Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Plan IT Coverage and Systems Monitoring for when you are Gone

by Dave Atkins on June 24, 2009

When you set up monitoring of servers, services, networks, and applications, it is easy to forget why certain tests are in place and/or to assume you will remember what to do about certain problems. You should invest the time to plan out how to monitor the functionality of applications, not just up/down status, and you should think about how you document that approach so others can follow in an emergency. Some ideas:

  • Use a wiki to write down your overall strategy and solicit feedback from other team members when you are thinking about what to monitor in the first place. You might start with an outline of the key business tasks behind a web site, then validate that with the product manager and other business people, then get very specific instructions from the tech team about what to monitor. Then, follow up and write down exactly what you are monitoring and how to tell what the alerts you set up mean. Leave a trail of documentation.
  • Incorporate this documentation into your monitoring by customizing the alert emails. We tend to want to keep those emails short so they can be read on a pager, but consider including a link back to your documentation or even setting up a separate alert that is triggered by a sequence of “correlated events.” The type of alerts could be turned on ONLY when you are not available (e.g. on vacation) and configured to go to a wider audience.
  • Don’t over do it. The problem with setting up elaborate monitoring schemes and alerting too many people is the risk that too much information will mean people ignore it or assume someone else is going to do something about it. So use your special alert emails sparingly and disable them when you are around.
  • Consider configuring some alerts to go directly to status pages. Heroix Longitude allows you to publish reports directly to a portal. People will likely ignore these things during “good times,” but think about how your less experienced colleages might have to react in an emergency. Can you set up alerts to send emails with links to your best efforts at “forensics?” When you are first setting things up, you may do this for your own use, but then you’ll tend to prune things down to filter out the noise. Save your original work and re-activate it in situations where you are asking others to help cover for you.

The key to making all of this work is to do it from the beginning, not as your last task before leaving for two weeks. The reality is that your elaborate coverage plan might break down immediately if it has not been used in practice. And if you are always the one who jumps in and solves the problems…no one else is ever going to really learn how to use the systems you’ve created. It’s just like a backup or disaster recovery plan–unless you actually use it and practice it, it’s just a bunch of ideas on paper. So take the time not only to plan well, but to bring other people into your coverage plan from the beginning. Examples:

  • Notify the DBA of any database-related alerts. But don’t just cc that person on every message that goes out. Send one alert that links to your page and reports, then, during a post-mortem, go over what the email meant and validate that the DBA would have been able to do something useful if you had not been around.
  • Don’t automate manager alerts. I recall one software product we used for monitoring that had different templates like “email to Boss.” The template was something like “We are busy addressing the problem, don’t worry.” That is a recipe for trouble. Don’t alert people unless you want them to take action. You can use reports and documentation with links to generated history and status to inform–but don’t bypass the people you want to take action. Instead, send an email alert with the links I described earlier; the person on call can then forward that email on to a manager with a legitimate/authentic message like “I received this alert and am investigating; see links below for background.”

The title of this post sounded a bit ominous…”when you are Gone” means, hopefully, “on vacation.” But its a good survival strategy in general to always be thinking about creating effective shared responsibility. It’s a mistake to think indispensability equals job security because when the company hits a business crisis, everyone is replaceable. You will find yourself becoming a DBA because that role was cut or your boss will suddenly have to become an expert on systems administration. The best security is to stay ahead of problems and make everyone indispensible.

Share this post:
  • E-mail this story to a friend!
  • StumbleUpon
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google
  • Furl

[Post to Twitter] 

No Comments »

No comments yet.

RSS feed for comments on this post. RSS must be enabled on your computer.

TrackBack URI

Leave a comment

© 2010 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com