We are experiencing delays in "Your Service Call has been created" e-mails being sent to callers.
I have traced the issue back to the application servers (we have two, this appears to be affecting one box 80% of the time and the other box 20% of the time) and in particular two database rules.
The set up is as follows:
When a Service Call is created the rules are as follows:
1) An e-mail is sent to the caller notifying them of creation 2) A boolean field is set to 'No'
I have noticed that in some cases a call is created at 09:00 and the boolean field is set to no at 11:00 - a 2 hour delay. This appears to be sporadic and I have rebooted all of the application servers, which has not solved the problem.
A similar databse rule (as described above) is initiated when a Service Call is created and sends an e-mail to the caller notifying them that the Service Call was created. I think this issue is affecting both of these "When a Service Call is created" rules.
Does anyone know what may be causing this, and what I can do to troubleshoot the issue? There are no scheduling validations on the rules and they are both active.
Q. in general do you have a lot of scheduled tasks running on either app server at the time of the delays?
A. We have two application servers, both running two instances of Service Desk (Ports 30998 and 30999). The logs show the following numbers of scheduled tasks:
SD1a: 2250 SD1b: 600 SD2a: 396 SD2b: 704
(hmm, SD1a seems a bit high?)
Q. Are you running the OVSD app servers with extended memory settings?
Both machines have 4.0GB RAM and have a 2.0GB page file allocated (see screenshot).
Q. Are there any errors showing in the log?
There are no error messages whatsoever in the logs. However, if a call is created at 10:00 the "Your Service Call has been created" e-mail shows in the logs as being processed at 12:00 (i.e. the delays are not on the Exchange server).
Q. Do you have any other applications integrated with OVSD e.g. OVO that is set to create alarms automatically?
Justin - we've not established exactly how many clients are connected at peak times - you said they were pretty evenly balanced but how many? More than 40 per App Server instance and I would be looking to bring in some more?
I don't think the number of Sceduled Tasks is high enough to cause problems although the imbalance is intriguing, do some go back a long time perhaps to when there were less App. Servers.
It was exactly these sorts of delays that first led us down the route of multiple App. Servers per box, we now have 4 boxes with 2 server instances per box (we've never needed to go higher).
I guess next we'd be looking at the sheer number of rules to fire? Are there other signs that the rule queue is slow to process (or conversely, do other rules seem OK?). Have you tried promoting a "mail rule" to head of queue to see if any difference is observed? Is it possible that you have HTTP POST rules attempting to get data from web-sites which might themselves slow down? (I believe use of this facility has to be handled with GREAT CARE since there is evidence that the rule manager waits synchronously for a reply).
Thank you for your reply. At peak times I have approx 40 users logged on spread out over 4 server instances (2 per box).
In general, the rule queue is fine - the only two rules that appear to be affected are the field that is set to 'No' when a Service Call is created and the mail send rule. I'm not sure how to prioritise mail rules over others?
We don't have any rules which retrieve data from other sources, so don't think the HTTP POST issue would be relevant.
As an aside, I restarted the problematic server this morning and the e-mails began to send instantaneously. This was short-lived, however, and the e-mails and DB rule have slowed down again.
Looks like I have narrowed it down to SD1a server instance. This is the one with the high amount of scheduled tasks.
I restarted the service a moment ago and noticed the following error in the logs:
Thu, 03/05/2007 10:40:02 Service: HTTP Service on port 30980 disabled. Error: Address already in use: JVM_Bind Thu, 03/05/2007 10:40:02 Metrics cannot be collected, port 6001 is in use Thu, 03/05/2007 10:40:02 Starting agent manager on dubsd1.meteor2000.ie
Could this have anything to do with the delayed e-mails? Error appears to relate to the Java Virtual Machine.
mmmm! So there might be something causing a "delay", we just need to find it!
The HTTP error makes me wonder if your app server settings are correct? You know you said you had ports 30999 and 30998 for the ITP protocol - what did you do about HTTP and SMTP? On my "dual" servers I have HTTP and SMTP disabled on the second instance, your error message might suggest that HTTP is enabled on both and on the same port? Please also check SMTP.
I don't see how this would cause the "mail delay" we are seeing though!
I have narrowed the issue down to a custom web application we use to create calls (via the Web API). This web app is hard coded to connect to one Service Desk Server instance which was causing a high volume of scheduled tasks on one server instance alone.
If it's OK I'd like to keep this thread open for a while as I have a few questions relating to load balancing I'd like to bring up, when I have some time to write them down.