>> Correlating Events to Recognize Problems
Every engineer and manager who receives alerts from automated monitoring systems can relate to both the critical need they fill and to their often annoying short comings. I’m not just referring to a situation where the monitoring has recently been installed, and you haven’t tuned the default thresholds to limit notification to actionable events. The nature of monitoring and alerting, even with the most sophisticated programs, is that problem events are based on very narrow criteria, like a server’s CPU Load, or a router’s bandwidth consumption, or some application’s .NET errors. It’s good to know about these specific problems. But if they are related in a complex problem, that diagnosis can easily be missed when these separate events are surrounded with hundreds or thousands of other random problem events. Recognizing the complex problem is further complicated because the events come from different sources: servers, network appliances, and applications. Of course most real world applications depend on distributed environments for reliable service delivery, and problems occur whose symptoms span multiple devices and programs. If you want to get ahead of the curve to immediately recognize and fix complex problems, then you need to start correlating multiple events so that you can send intelligent notifications that describe the conditions and fixes for complex problems.
Subscribe by RSS