Visit Heroix at http://www.heroix.com
Subscribe to the Heroix eNewsletter
Visit Heroix at http://www.heroix.com
Charting Life in the IT Environment

>> Plan your next IT Crisis

by Dave Atkins on June 12, 2009

One of my favorite ironies was listening to my boss say, with a straight face, how we needed a report on the likely unexpected problems for the next week. You know you are in trouble when people are asking for the “what could go wrong? report,” and you need to develop a plan for what do we do when the unpredictable happens.

Our podcast describes how the Heroix Longitude product can automate monitoring of network, system, application, security and other event logs. The product I blogged about earlier this week, splunk, is a great tool for discovery. Additionally, you will need a toolkit of “system snapshotting” tools. And don’t forget to reach out to other professionals and develop a network of real live people you can turn to in a crisis.

Then make a plan…people often refer to the chaos that happens in unmanaged IT operations environments as a series of “fire drills,” but it’s not, really. In a real fire drill, an alarm goes off and everone gets up and proceeds in an orderly manner down the stairs to a designated meeting place to wait for the fire truck to show up and declare the building safe. People don’t stand around pointing fingers at each other and trying to find someone to blame; they just wait for the professional fire fighters to show up.

When IT problems set off an alarm, you know you cannot predict it–otherwise you’d prevent it! But you can develop a three-pronged approach to handling the crisis so that you and your team are free to solve the problem and not distracted by chaos, confusion, speculation, and blame.

  • Implement a monitoring and alerting solution that gives you some kind of “sunny day” reporting. Our product Longitude can give you the “single pane of glass” that consolidates many servers in the enterprise and rolls up event log reporting in a manageable interface. You can also use a variety of other commercial and open source tools–but invest some time in learning how to generate meaningful reporting so you have some context for when a problem happens.
  • Prepare your “red alert!” tools. Talk to developers and systems engineers to find tools they are using for performance and debugging and find out if you can write an outline/plan of attack on how to gather useful data for developers from the operations environment in a crisis. For example, you might want to install JAMon (Java Application Monitor) on a Tomcat server. Document what to type to quickly gather basic system data by using iostat, vmstat, netstat, and top on Linux or Windows Perfmon. Your eyes may quickly glaze over with the complexity–so work out a one-page cookbook of what anyone on the team can do to gather data while the servers are in a crisis state.
  • Have a way out. No, this does not mean ensuring that your LinkedIn profile is up to date. Plan what you can do to restore some degree of functionality and service so people know you are working on the problem. In a web environment that might mean to deploy a (shudder!) “site down for maintenance page.” Or maybe that is unnacceptable…then you need to find out what is. While you are trying to deal with a crisis, time runs faster for you than the customers…what was going to take “a minute” takes half an hour. So you need a way to buy yourself time and avoid people looking over your shoulder and demanding status updates.

This plan is your real fire drill. It’s a way to turn the “what is going to go wrong next” conversation into “what we are going to do next.”

Share this post:
  • E-mail this story to a friend!
  • StumbleUpon
  • Digg
  • del.icio.us
  • Facebook
  • LinkedIn
  • Google
  • Furl

[Post to Twitter] 

No Comments »

No comments yet.

RSS feed for comments on this post. RSS must be enabled on your computer.

TrackBack URI

Leave a comment

© 2010 Heroix | Heroix | RSS | Privacy Policy | Email: info@heroix.com