Are you passionate about making your Production Support team better? Join me as we explore topics in Production Support of Mission Critical applications.
Friday, September 13, 2013
The Info You Need When You Need it Most: Runbooks
For this post, I'm going to continue to focus on the knowledge aspect of an application. In particular, I'll talk about Runbooks.
Runbooks should be the first point of reference for anything related to an application. Each and every application you support should have a runbook. Otherwise, it would be like flying an airplane without a manual (for those who didn't catch the reference, every pilot has to use the airplane manual when starting it, no matter how familiar they are with the model).
Runbooks should contain some key information about an application.
The most important section a runbook should contain is a Business Context section which provides the users some idea of the business processes, their criticality and potential financial impact. Most runbooks I've seen don't contain this section, but I like to have this in place. This section should help to further solidify to a group of techies that they don't support some technology or application, but a business instead.
Runbooks should inform the analyst about the Architecture of an application. It should provide an overview of the servers and databases they communicate with. The Architecture section should provide a network context for the application, as well. It should also depict any middleware being used and also provide an idea of other upstream and downstream dependencies.
Another key section for the Runbook is an Administration section. This section should provide the user information about things like how to restart processes, scheduled jobs, breakglass procedures and start/end of day checks.
Likely, the most critical section in a runbook, when it comes to incidents, is a Monitoring and Alerting section. This section of the runbook should provide a list of common alerts and how to resolve them. This section might also contain information about the eyes-on-glass procedures for monitoring the application.
Next in criticality from the Monitoring and Alerting section is the Escalation section. The contact details for Development and Key Business users should be documented there. Also, contact information for key Infrastructure teams and Upstream/Downstream teams should be captured.
A section which provides more detail about how the application works would be an Application Deployment section. This section should contain information like which locations an application is deployed in and what dependencies it has.
The Monitoring and Alerting section should be supplemented with a Troubleshooting section which captures the most common issues, known bugs and limitations.
A Tools section in a runbook which contains the common tools the team utilizes for troubleshooting might be a good thing to document as well. New team members would certainly appreciate having a handy list of the tools their teammates use and perhaps links to downloading/installing these tools should be there as well.
A final word about Runbooks. Do you want to assess your team's proficiency when it comes to application knowledge? Make a bulleted list with each section of your runbook. Pick some topics from each section and make a little quiz. You'll now have a quick and dirty way to find out their proficiency level.
Subscribe to:
Post Comments (Atom)
Very nice write-up on runbooks. Documentation is an area that usually is lacking within IT organizations. What is your opinion on Application Aware Runbook Automation and how it fits into what you've written above? Here is a link for your reference... http://www.appdynamics.com/blog/2013/03/14/application-runbook-automation-detailed-walk-through/
ReplyDeleteFull Disclosure: I work for AppDynamics but I am genuinely interested in your opinion.
I like the feature. I've seen this feature before in other tools, but there are a couple of things I really liked:
Delete1) That AppDynamics has a way of increasing log output built into it.
2) That the tool is quite aware of KPI information.
This is where other monitoring tools come a little short. For example, ITRS has a feature where it runs an "Action Script." It's similar to the RBA feature of AppDynamics, but ITRS doesn't capture KPI information as cleanly (where you can see the effect of the remediation script real-time). This is important information to know. In the example you guys provide, where you increase connection pool sizes, you saw that throughput had gone up. Increasing connection pool sizes can have adverse effects. What if you'd increased it to 2000 instead of 35 (albeit a contrived scenario)? You might have seen a huge memory problem and perhaps even less throughput. So, knowing the effect is critical. Good stuff.
Nice .... agree with your comment on inclusion of some business context. It will help analyst do a quick and meaningful impact assessment.
ReplyDeleteOveruse or underuse of runbook can give different directions to focus on in terms ... could be application stability, run book being kept updated, quality of runbook etc ...