Available for Consulting

Need a job? I might be able to help you find one. Need help? I'm available for consulting engagements. Send me an e-mail. Or you can contact me via Google+ or Linked In.

Wednesday, August 7, 2013

Mayhem is Everywhere: Incident Management

Do you hear that sound? That's the sound of your application coming to a screeching halt because the "new guy" (should we call him Mayhem?) just decided to install a new test server. It just so happens that the server is configured with a duplicate IP address, bringing an entire subnet (and about 40 servers) down. *Any similarities to real events are not intended.


So what now? No insurance company (not even Allstate) can protect you from this type of mayhem, so you're on your own Prod Support guy! But you're prepared, of course, because you worked very diligently to define an Incident Management Process.

What does a good Incident Management Process look like? Good Incident Management processes handle two key pieces: Service Restoral and Communications. The former is likely more intuitive than the latter, but both are critical. Communications give your stakeholders confidence that service is being restored and some insight into why it might be difficult to recover from certain situations.

When an incident is first detected, the very first thing that Production Support groups must do is acknowledge that there is an incident and begin communicating with stakeholders: business partners, technology partners and external clients, if necessary. An initial acknowledgement (email) should be relatively easy to generate as it doesn't really require much detail. It's basically just a note that provides confidence to stakeholders that someone is looking into the issue with the appropriate level of urgency. Another important part of this initial step is to open a ticket (or add some kind of entry) in a tracking system, to ensure that we'll be able to keep metrics about the incident. A good best-practice is to put the ticket ID in the initial acknowledgement e-mail and in follow-up communications.

Once the acknowledgement is out the door (or in parallel) the incident triage and troubleshooting begins. In most shops where I've been, this is done via an incident conference call (or bridge). In many cases, the bridge convenes even before the acknowledgement is out the door. At this stage the severity of the incident is assessed and the right resources are called to help resolve the situation. One useful thing to do is to add all participants to a group chat so that it's easier to share logs, server names and other information that might be difficult to communicate via phone.

Regular updates to stakeholders of the results of the investigation and resolution steps should be sent out. I recommend that those should go out every 15 minutes - however, I've seen some shops that do this on an as-needed or even an hourly basis.

Email updates should follow a standard template. When designing the template, consideration should be given to the fact that many users are looking at the updates via hand held devices (which might not be able to render complicated tables or graphic-intensive html).

Once the issue is resolved, a clear "Service Restored" notice should go out to the stakeholders to let them know that the system is available to them once again.

We're not done yet! Once the service is restored, an assessment of availability should be performed to determine how much the incident has made us deviate from the 100% availability mark. Also, a root-cause investigation should be performed and the whole incident should be summarized in a formal Post-Mortem or Executive Summary. I'll cover more on calculating availability impact and on executive summaries in another post.

At this point, the Problem Management process begins.

2 comments:

  1. A very good blog related to the nuances of Production Support! Well and Truly explains how important is the communication in case of an Production incident as is the resolution to it.
    Even if all the Support teams are involved in finding the root cause and resolving the Issue, without any communication to all the stakeholders involved, it does not matter much because it defeats the very purpose of the Users / Stakeholders of the system, who though look forward to an resolution, are more importantly inclined towards knowing that the Issue is being looked at and if any delay in fixing it are there any other strategic decisions that can be taken to mitigate the risks.

    ReplyDelete
  2. Communicating really plays a vital role in the Incident Management Process. Assuring the business partners that the issue reported by them is looked upon, helps to gain their trust and makes them feel comfortable. Regular updates to the stake holders helps in keeping everyone assured and aware that the issue is handled with maximum priority. In the meanwhile, partners may try to hunt for a work around, to mitigate the loss.

    ReplyDelete