PolyMon Executive Unable to Keep Up with Load

May 16, 2007 at 12:23 AM
I have been playing with this against 40+ boxes. It seems to have slowed down and is having problems keeping up. I have about 400 monitors that are set to trigger once a minute. Based on the following query there can be up to 180 seconds difference from the oldest entry to the newest.

I have both SQL 2005 and the PolyMon Executive running on a 2 x Xeon 2.4 Ghz system with 4 GB of RAM. It has other stuff running on it, but they are almost entirely inactive. It has a 3ware controller for disk and about 15 disks. I'm running both SQL 2005 and PolyMon Executive on this. It should be able to handle this type of load without problem.

	min(mcs.EventDT) as 'OldestEntry', 
	max(mcs.EventDT) as 'NewestEntry', 
	datediff(second, min(mcs.EventDT), max(mcs.EventDT)) as 'DiffSec',
	count(*) as 'Enabled1MinuteMonitors'
	dbo.MonitorCurrentStatus mcs with (nolock)
	join dbo.Monitor m with (nolock)
	on m.MonitorID = mcs.MonitorID
	m.TriggerMod = 1 
	and m.IsEnabled = 1
Here are my results:
OldestEntry NewestEntry DiffSec Enabled1MinuteMonitors
2007-05-15 16:23:13.000 2007-05-15 16:26:12.193 179 388

The processor on the box is averaging about 9%( peaks at about 20%), disk transfers per second is around 3, it's transfering about 15 KBps to disk, and there isn't any blocking in SQL. I know I could adjust my TriggerMod to 5 minutes, but most of these I'd like to see more often. It looks like at least the piece connecting to SQL is single threaded, since I've watched and I only see a single connection doing all the work. Based on that I did a capture of a TriggerMod schedule. The duration for all the stored procs starting with polymon_ins was 41.7 seconds for 388 monitors. The SQL Server could definitely handle more load, but unfortunately not serially. Either the stored procs have to be made faster or the inserts have to be made in parallel. The best solution of coarse would be PolyMon to be multi-threaded in processing the monitors, multi-threaded when adding to SQL, and the stored procs better performing.

I know it's a lot in one email, but it all seemed to be related. Thanks for your work on PolyMon. Currently it is really good for monitoring smaller environment and I can see potential for much larger ones.


May 16, 2007 at 5:17 AM
The ability to multi-thread the monitoring service is already planned. See Roadmap for more details.
In fact, the idea is not only to multi-thread the monitoring service but also to allow multiple services running on multiple machines being responsible for their own monitors since multi-threading by itself is still not sufficiently scalable.

We have been using polymon with about 150 monitors succesfully - on very very low-end hardware (but we do have the executive and the database on separate machines). Sure it's not "real time" data the way PerfMon would be , but it's not meant to be. We do run the main timer on a 1 minute interval, but we do not need to run every single monitor every minute - some run every minute, some run every 12 hours - depends on what we are monitoring and the level of granularity we are looking for in counter information. Used this way we have not encountered e severe performance degradation you seem to be having.

One point regarding your timing exercise, I do not believe it is reflective of a SQl bottleneck. At this point, making the inserts more efficient is not going to help any - your timings are capturing the entire time, including that spent by the service running the monitors as well as storing the data. If SQL Server is not capabale of handling 400 inserts per minute (and the inserts are probably as small as any OLTP system) then maybe the Oracle folks are on to something :-)

The main culprit is the service itself, not SQL. As you righly point out, it needs to be multi-threaded (and in fact needs to scale out, not just up).

As far as SQL performance goes, the largest bottleneck in the inserts is going to be the triggers that update the aggregate tables. If you can live without aggregate tables that are updated real-time, then I would disable the triggers and replace with a periodic aggregate table update (I believe the SP to do so is already included) that you could run every 24 hours.

I would still be happy to hear your suggestions for improving the sp efficiency - can never be too efficient. But my development time at this point will be better spent on the larger bottlenecks caused by the lack of multi-threading in the monitoring service.
May 17, 2007 at 5:57 PM
Road map looks pretty good, I'm glad it's moving to multiple PolyMon Executives running at the same time.

Looking at my timing again on SQL, it looks like I found a large part of what was slowing it down. There were 2 places I made minor changes and the I went from 3 minutes to run the monitors to just under 2 minutes. The first was in the MonitorEvent.UpDownTimeCalculation trigger and the second was in the polymoninsEvaluateEventAlertStatus stored procedure. Before it was doing 18,000 reads per execution to around 100 reads. It also went from a about 100 ms to execute to about 25 ms.

The change in the trigger was to comment out the call to fn_LifetimePercUptime. This is doing a scan of all entries for a monitor in MonitorEvent. This call was itself doing 17,000+ reads in my database. Since this isn't a critical statistic for me I just set it to 0 for everything.

The change I made to the stored proc involved determining the last Alert EventID. It was doing a join between MonitorEvent and MonitorAlert even though all the data it was using seems to already be in MonitorAlert.

I took this:
select top 1 @LastAlertEventID = MonitorAlert.EventID
from MonitorEvent
	inner join MonitorAlert on MonitorEvent.EventID=MonitorAlert.EventID
where MonitorEvent.MonitorID=@MonitorID
order by MonitorEvent.EventDT desc

and changed it to this:
select top 1 @LastAlertEventID = EventID
from dbo.MonitorAlert
where MonitorID=@MonitorID
order by EventDT desc
And then added an index for it.
create index idx_MonitorAlert_MonitorID_EventDT_EventID on dbo.MonitorAlert
( MonitorID, EventDT, EventID) 

The effect of these changes was to lower the CPU usage and improve the execution time. The first change obviously isn't something you would want to use since it would involve turning off functionality for PolyMon, although the function probably wants to be looked at. The second change plus the index should be able to be used in the code, unless I misunderstood something about the join.


May 21, 2007 at 1:40 PM

You are right - upon looking again at the EvaliateEventAlertStatus stored proc, there does not seem to be any reason why it would need an inner jon to the MonitorEvent table - I will change this for the next release.
You are also right that the Lifetime % Uptime calculation can cause high utilization - the function can be modified to update the % calculation - I should be able to do this without re-reading all the individual monitor events (and in fact, because of the recent changes that added data retention schemes, that calculation would in any event be incorrect once base monitor events are deleted). I will also look into this for the next release.

Thanks again - if you have any more suggestions they would be most welcome.
May 21, 2007 at 1:42 PM
This discussion has been copied to a work item. Click here to go to the work item and continue the discussion.