If you've been in Production Support for any length of time you've had one of those weeks: where everything seems to fall apart, all at once, and you get woken up in the middle of the night to troubleshoot what seem to be never ending issues. It happens to all of us, including companies like Amazon.com, Outlook.com and, yes, even Google.com (check
this link to see what I mean).
Outages will happen and effectively dealing with them requires a strong Major Incident Management discipline. The work isn't done, even if service is restored. It's time for Problem Management!
Problem Management is the discipline that helps prevent outages from happening again (and which will allow you to get on the road towards higher system
availability). There are 3 key components to Problem Management:
- Root-Cause Identification
- Identification of Follow-up Actions
- Tracking Follow-up Actions to completion
Every single incident has a root-cause. I repeat, every single incident has a root-cause. You might have a difficult time determining the root-cause, but that doesn't mean there isn't one. And in most cases, the issue will reocurr if steps aren't taken to correct the situation.
I've seen many a Production Support analyst do this: every time an issue shows up (especially minor, recurring ones), they manually intervene, correct the problem and sit back. There is absolutely no value in doing this! This means the issue will come back and there is risk in impacting users. Not to mention that it's just a bunch of needless work. I don't understand the rationale behind this approach, but in case people want to be depended on, this approach doesn't ensure job security either. If anything, it's good way to get walked out the door the day stuff hits the fan and your manager finds out you've know about the problem for months. I once found a situation where a manual process was being done, for 12 years, to handle a particular situation!
For every single root-cause, there is one or more actions that can be taken to prevent the issue from happening again. Each of these follow-up actions needs to be tracked to completion. At most places where I've worked, each of these is put into a Problem Ticket. The follow-up actions might involve the intervention of other teams, for example, Development groups. So it's critical that the Problem Management process is adopted by the entire organization and that all stakeholders are ready to work on the tickets that get assigned to them.
Follow-up actions are typically discussed, created and assigned during a post-incident review. This can be informal process, if the incident is small, but can be a very formal, high profile meeting for high severity, critical outages. I'll write more about how to run an effective post-incident review meeting in later post.
There will be organizations where not all stakeholders are on board with being assigned tickets and tracking them to completion with the right level of urgency. In these cases, I've had success by getting a few key people from those groups to attend a recurring meeting. I've found weekly meetings to be most effective, but semi-weekly or even monthly might do the trick. During the meeting, the Problem Tickets are presented with a focus on the highest priority ones. The general idea is to determine what can be done to implement the follow-up action and determine when it can be prioritized to be worked on. This is a balancing act, of course. But being relentless, and the due diligence of organizing the tickets and presenting them, inevitably lends credibility to the process. External teams are more willing to work on implementing the solutions if they see you're committed as well. I was once at an organization where we had about 300 follow-up action tickets when I first started there. In about a year, that number had dropped to about 20 outstanding items, including new problem tickets that were being opened.
One more thing about Problem Tickets. Incidents shouldn't be the only source for creating Problem Tickets. Recurring issues, enhancements, and automation opportunities can also be tracked in Problem Tickets. They might be prioritized differently, of course, but the general idea is the same.
So start working on your Problem Management discipline. You'll have great system availability to show for it - not to mention, you'll be able to sleep, finally.
No comments:
Post a Comment