We have: 6 w2k3 Servers 1.5gb mem 2.8ghz processor running the app servers 1 Linux Server same spec as w2k3 servers running an oracle 9.2 DB
Almost every day for the last week and intermittently for the last few weeks, one or two of the application servers hit over 80% cpu (normal >10%). this has the effect of freezing anyone logged on to that server, not letting anyone else log on and eventually bringing down the entire service desk service.
It is the sd_serverservice.exe process that is using all the CPU.
I originally thought that this was just one of those things as it only happened once a week, then we experienced an issue with netbackup and thought that this was the cause, this now fixed and the issue with App servers is getting worse.
Has anyone experienced this themselves?
Is there anyway of seeing what is going on with the service desk service ?
I have turned rule deugging on but this does not show anything, my scheduled tasks are around 5000 which i am lead to belive is normal.
I have run the rule debug sent to me while ago by HP but its formating is simply useless.
do you have data exchange tasks running? Does the logserver.txt file show any errors? How many users do you have accessing the system (approx)? What is the weighting of the application servers or are they on a 1:1 ratio? Do you have a lot of users accessing OVSd at the same time e.g. users logging on from one location? Does it look like memory is also being taken up on the server(s)-the default memnory allocations are generally insufficient and can be changed.
Does the logserver.txt file show any errors? nothing at all, even with rule debug on
How many users do you have accessing the system (approx)? during the night time about 10 per server, day time around 50 per server
What is the weighting of the application servers or are they on a 1:1 ratio? all equal
Do you have a lot of users accessing OVSd at the same time e.g. users logging on from one location? yes that would be 8.00am GMT but no correlation between time of day and issue
Does it look like memory is also being taken up on the server(s)-the default memnory allocations are generally insufficient and can be changed. Memory appears to be fine during this time I have allocated 1gb to the JVM _______________________
At the time when the process spikes over 80%, can you see that the process (sd_serverservice.exe) is using the full gigabyte of memory?
When this occurs, there are a couple of things you can do to see what happens. You can turn rule manager logging by executing this from the server/bin directory: sd_servermanager.bat /monitor [servername] [port] com.hp.ifc.ev.dbrules.AppDBRuleManager setMonitoring true
Setting the last variable to false will turn off the rule manager logging. I think that this probably has the same information in it as turning on rule debugging through the admin panel.
You can also view the status of the threads that are created by the process. The easiest way to do this is to start the server using batch file. This will allow you to see the server console. On the server console, if you click the button named "log monitoring information", the thread information will be logged to logserver.txt. It will allow you to see which threads are active.
Does those problem app servers have specific task eg. for database reporting, servicepages or login only ? Look for external events that may create additional load on the apps server. Do you get a lot of unable to connect, or connection reset by peers or cannot connect to mail serveror any error messages that may indicate problem with networking ? There is a known problem where if an apps server cannot connect to a mail server within a certain period, it pushes the servicedesk to the max and crashes the system but I can't remember whether it pre or post Sp9. Have a look at self solve. You can turn on the apps server gui which will provide you with a gui that shows the performance of your server, thread, queue and etc. This can be turn on via the admin console, system panel or via -monitor parameter. If using the -monitor for service, you will need to turn on the 'interact with desktop' option from services.msc or nothing will happen.
If only 1 connection reset by peer, that is quite normal, where timeout occurs and connection is shutdown as no response on the other end. Do you integration happening with the problem server ? Multiple sd_events coming in perhaps ? You can run the following to see how big your system has grown. select ent_name, count(*) 2 from rep_javaobjects, 3 ifc_entities 4 where ent_oid = jav_entity 5 group by ent_name; But I would have thought 6 apps servers would be sufficient. I found the details of some possiblities that may cause your symphtoms. ITSM005939 email server not responding OV-ENSD42939 oracle tablespace running out of space. OV-EN016405 load balancing not working from sp6. If you have rule out everything, you might want to consider adding the 7th apps server.
The number of scheduled tasks that you are running should not be a problem. We have approximately 3000 tasks per server (4 servers) with no issue and we have similar hardware.
Can you see an correlating events in the log? Since the server seems to disallow any further connections, would it be safe to assume that the last entry in the log was the problem? Maybe it's that email bug?
How big do your log files get? Could the logs be full?
I would also check the load balancing to make sure people are dividng evenly on the servers by looking at who is logged in on the administration panel. You should be able to sort by server and take a rough count. We had to set all of our server weight ratios to 350 (instead of 1) due to a problem in load balancing we encountered.
Also, when you look at the sd_serverservice task when it is at 100%, how much memory is it using?
What java version do you use on the server?
I know that these ideas are particularly helpful but they are good things to eliminate and may help uncover the issue.
Also, not that this is a cause, but windows server 2003 isn't officially supported by HP Service Desk Service Pack 9. It becomes support in one of the Service Pack higher than 17, if memory serves. I don't think that this would present a problem but its something to note.
Tim is right, your tasks, rules looks reasonable. The question seem to be why is your sd service working hard at certain times. At 2:30 am, it should be a low period unless this is the time where some user runs big report, multiple updates, db maintenance or something. Is there any pattern to the 100% utilizations ? Have you spoken to the users if any around the 2:30am peak as to what they were up to ? I was thinking maybe it was a java garbage collection issue but as it does spike when it tries to recover memory but your will seeing more spikes than you currently are.
thanks for the informtion guys HP sent me the hotfix for the SMTP servers issue yesterday. Even though there was no evidence that it was the issue we applied it as a process of elimination, and guess what; no spikes last night! So hopefully we may have the fix!
in answer to your questions though:
Can you see an correlating events in the log? No this has been the most frustraing thing, no errors (well out of the ordinary) that tie in with the time of the spike,this is with debug on aswell, the only error thats is consistent is the "flush error" when the CPU is maxed out.
How big do your log files get? Could the logs be full? We have Patrol monitoring that interogates, segments and archives the logs every few hundred K.
I would also check the load balancing: Never moves out of around a 2 or 3 differential between servers
Also, when you look at the sd_serverservice task when it is at 100%, how much memory is it using? Barely any, we have ahd memory isseu issues in the past and this was my first port of call.