Available for Consulting

Need a job? I might be able to help you find one. Need help? I'm available for consulting engagements. Send me an e-mail. Or you can contact me via Google+ or Linked In.

Saturday, August 31, 2013

Get Some Sleep: Problem Management

If you've been in Production Support for any length of time you've had one of those weeks: where everything seems to fall apart, all at once, and you get woken up in the middle of the night to troubleshoot what seem to be never ending issues. It happens to all of us, including companies like Amazon.com, Outlook.com and, yes, even Google.com (check this link to see what I mean).


Outages will happen and effectively dealing with them requires a strong Major Incident Management discipline. The work isn't done, even if service is restored. It's time for Problem Management!

Problem Management is the discipline that helps prevent outages from happening again (and which will allow you to get on the road towards higher system availability). There are 3 key components to Problem Management:
  • Root-Cause Identification
  • Identification of Follow-up Actions
  • Tracking Follow-up Actions to completion

Every single incident has a root-cause. I repeat, every single incident has a root-cause. You might have a difficult time determining the root-cause, but that doesn't mean there isn't one. And in most cases, the issue will reocurr if steps aren't taken to correct the situation.

I've seen many a Production Support analyst do this: every time an issue shows up (especially minor, recurring ones), they manually intervene, correct the problem and sit back. There is absolutely no value in doing this! This means the issue will come back and there is risk in impacting users. Not to mention that it's just a bunch of needless work. I don't understand the rationale behind this approach, but in case people want to be depended on, this approach doesn't ensure job security either. If anything, it's good way to get walked out the door the day stuff hits the fan and your manager finds out you've know about the problem for months. I once found a situation where a manual process was being done, for 12 years, to handle a particular situation!

For every single root-cause, there is one or more actions that can be taken to prevent the issue from happening again. Each of these follow-up actions needs to be tracked to completion. At most places where I've worked, each of these is put into a Problem Ticket. The follow-up actions might involve the intervention of other teams, for example, Development groups. So it's critical that the Problem Management process is adopted by the entire organization and that all stakeholders are ready to work on the tickets that get assigned to them.

Follow-up actions are typically discussed, created and assigned during a post-incident review. This can be informal process, if the incident is small, but can be a very formal, high profile meeting for high severity, critical outages. I'll write more about how to run an effective post-incident review meeting in later post.

There will be organizations where not all stakeholders are on board with being assigned tickets and tracking them to completion with the right level of urgency. In these cases, I've had success by getting a few key people from those groups to attend a recurring meeting. I've found weekly meetings to be most effective, but semi-weekly or even monthly might do the trick. During the meeting, the Problem Tickets are presented with a focus on the highest priority ones. The general idea is to determine what can be done to implement the follow-up action and determine when it can be prioritized to be worked on. This is a balancing act, of course. But being relentless, and the due diligence of organizing the tickets and presenting them, inevitably lends credibility to the process. External teams are more willing to work on implementing the solutions if they see you're committed as well. I was once at an organization where we had about 300 follow-up action tickets when I first started there. In about a year, that number had dropped to about 20 outstanding items, including new problem tickets that were being opened.

One more thing about Problem Tickets. Incidents shouldn't be the only source for creating Problem Tickets. Recurring issues, enhancements, and automation opportunities can also be tracked in Problem Tickets. They might be prioritized differently, of course, but the general idea is the same.

So start working on your Problem Management discipline. You'll have great system availability to show for it - not to mention, you'll be able to sleep, finally.

Thursday, August 29, 2013

Who you gonna call? CRM for Production Support Teams

Many Production Support teams don't consider Customer Relationship Management (CRM) an important part of what they do. In my opinion, there are very few things that matter more than knowing your customers and their needs intimately. I'll provide a few examples that will lend insight into why I feel this way. Before I do that, let me explain that I use the word customer interchangeably with stakeholder. Thus, customers can be internal business partners, external business partners, vendors, other internal teams, etc.


First of all there are basic needs that all Prod Support teams have in terms of knowing which teams to contact when resolving an issue goes beyond the Prod Support team's scope. For example, knowing how to contact a Network or DBA team can be critical. Thus Support teams need to have a well-known, easy to navigate Escalation Database. Most organizations I've been in utilize intranet Wikis or document management sites like Sharepoint to keep an escalation list. It's critical that these DBs be up to date and should contain critical information like:
  1. The documented team name: I can't stress this one enough. Many large companies have several teams that handle networks for example, knowing which team to call is critical. For example, the team that handles routing might not be the same team that handles load balancing, yet they both might land under the network umbrella.
  2. The escalation rotation with the hours that the rotation encompasses
  3. The contact telephone number and/or e-mail
Each person in the support team must know exactly where to find the escalation database and accessing this information should be easy. There are few things more important, when you get a new person in your Support team, than providing them the information of whom they should call.

A few more thoughts on the Escalation DB: It should also contain the names of all the support staff and managers. There should also be a process to maintain the Escalation DB (most teams will update the information organically, however, giving the Escalation DB a quarterly or semi quarterly review is always healthy).

More advanced CRM practices can provide Support teams valuable information about clients, especially external customers. Utilizing CRM tools like Salesforce.com or Microsoft Dynamics allow teams to track various pieces of information for customers. For example, different customers might have different times when they connect to a system to perform transactions. Also, it might be good to know which products a customer uses versus others.

True power of CRM systems (which can provide competitive advantage) comes when they're accessible to both business users and support staff. Well-documented and adopted processes to gather information about a customer and their transacting patterns allows teams to better prepare and support them. For example, when onboarding a new customer, a sales representative for a financial trading system might mention that a particular customer performs most transactions right before market close (let' say between 3:00 PM - 4:00 PM). Thus, the Production Support team might add extra monitoring to keep an extra eye out for this customer and their trasactions during this time. The team might notice that the customer isn't connected and ready to do business and might want to preemptively call to make sure that everything's OK. In many financial trading systems, revenues come in the form of fees on transactions. If the customer can't do business this could mean losses in revenues.

Thursday, August 22, 2013

Time Track to Identify Hotspots

I haven't met a support analyst, yet, who likes to enter tickets into tracking systems. This practice, however, is a necessity for support teams. Consider it a necessary evil, if you will. In fact, all support staff needs to understand that ticketing is just part of their responsibilities. Now, that doesn't mean we shouldn't explain why ticketing is necessary to analysts. With this post, I plan to explain the need for ticketing.


In my mind, time tracking's biggest benefit is to help with identify support hotspots.

Time tracking should tell us things like:
  1. How much time teams are spending with incidents?
  2. How much time teams are spending doing service requests?
  3. How much time are teams spending on change requests?
  4. Which applications utilize most of the support bandwidth?
  5. Which application components the biggest offenders?
  6. Are there recurring issues which should be corrected?
  7. Are there recurring service requests which should be automated (e.g. reports)?

Deep diving into the time tracking data, to get the answers to those questions, will provide insight into which areas should be the focus of stability programs. Deep dive sessions, where tickets should be reviewed for trends and correlated with availability numbers, should be programmed to occur regularly (I suggest at least quarterly) Once the various hotspots are identified, problem tickets should be opened for follow up actions (more on Problem Management in my next post).

Time tracking is also important for metrics and reporting purposes. In fact, time tracking is the basis for metrics and reporting. We can't report unless we're keeping track of issues and how they're affecting system availability. Important things to report on are:
  1. Availability metrics
  2. Top Talker incidents: These are the incidents which caused the most impact to availability
  3. Incident counts
  4. Service request counts
  5. Change request counts
  6. Support effort expended
Why do we even care about metrics and reporting? We care because we can't fix it, if we don't know if it's broken. Support teams can't work on "hunches" and need to focus their attention on true problems. If you can't measure it, you really can't fix it.

Finally, let's not be naive. Support is a cost center, and many senior managers are interested in making sure that support teams are staffed only with the necessary folks. Why spend the extra dollars, if we can't justify the expense. So at lest one point of time tracking is to justify staffing needs. For support managers, it also helps forecast future staffing needs as the application changes (which can mean an increase or decrease in staff).

Friday, August 16, 2013

Level with Me: The Levels of Production Support

I'd like to take a pause and answer a question: What are the levels of Production Support?


This is a question I see a lot in forums all over the Web. It regards what the different Support levels mean. Of course, this already is a loaded question because it assumes all companies have different levels of Prod Support. I can tell you, from experience, that this isn't always the case. Some companies' Support staff answer the phone, perform investigations on questions, and also perform break fixes (which would be a combination of the typical levels of Support).

But the question that most people mean to ask is, "In a 3-Level Support Model, what does each level mean?" This question is good and relevant because many, many organizations structure their Support teams this way.

So without further ado, here are the 3 Levels of Production Support...and what I notice they typically mean in various organizations:
  1. Level 1: This is the most basic of Support levels. Staff performing this level of support generally perform eyes-on-glass type monitoring. In the event of an alert, Level 1 staff rely on Standard Operating Procedures (SOPs) that they need to follow in order to correct a situation. The SOP might call for staff to run a script, restart a process, etc. Level 1 staff also field business requests and prioritize tickets. If Level 1 staff doesn't have an SOP to resolve the question or request, they must escalate to staff performing Level 2 Support. At this level, no in depth troubleshooting is done; only troubleshooting that will help define the issue at hand is undertaken.
  2. Level 2: This level of Support generally involves more in-depth troubleshooting and requires that the staff understand the way a system works quite intimately. Level 2 Support includes finding resolution to incidents for which SOPs are not available and finding answers to questions which involve intimate knowledge of the system. This might include poring through log files to determine why something went wrong or why a system transaction behaved in an unexpected way. Level 2 Staff might utilize the application code to resolve an issue. They're also in charge of handling service requests that involve creating custom queries to automate reports. When Level 2 staff cannot handle a particular request, they must escalate to Level 3 Support.
  3. Level 3: Level 3 Support involves an in-depth understanding of an application, so much, that break-fixes can be done by this level staff. Ideally, Support staff can generate this level of knowledge, however, many times, Level 3 Support is done by the application Development team.

Thursday, August 15, 2013

Capacity Management Shouldn't Be An Afterthought

Capacity management or planning is the often forgotten process that can make a huge difference in terms of application availability. When applications first come online, both the hardware and the software are sized to handle the volumes of business at that point in time. However, transaction volumes never stay the same and typically grow at faster rates than anticipated. The reasons for this vary, but generally, new applications enable business efficiencies to help it grow. Also, it's very hard to anticipate business growth volumes and external factors such as regulatory changes. All these changes can increase the demand for resources in the system, and Production Support teams need to be aware how these growths in volume can affect application performance.


Capacity Management attempts to avoid reactionary approaches towards sizing systems (usually at the point of impact) and instead it looks to cost-effectively avoid system degradation due to volume increases.

At a basic level, Capacity Management programs need to track basic system Key Performance Indicators (KPIs) such as CPU, Memory, Disk, Swap and Network utilization. The real power of capacity planning comes when transaction volumes can be correlated with the individual system KPIs. This enables teams to answer critical questions such as: If volume increases by x%, how well will my system perform? If your capacity management program doesn't answer that question with a good degree of certainty, then you don't have much of a program at all.

Beyond the basic KPIs, Production Support teams need to identify system-specific metrics that can give them insight into the performance of the tools they manage. For example, I used to run a group in charge of supporting FX platforms. A key indicator for FX is how quickly a price can be generated once markets tick. So pricing latency was a critical KPI for that system.

Capacity metrics need to be tracked and reported on, at least, a monthly basis. A thorough analysis needs to be done to determine how transaction volumes are affecting KPIs and a forecast analysis should be done to determine how upcoming volumes will impact performance. This can be done with simple regression analysis, where you look to determine how your transaction volumes correlate to your KPI numbers. You can find many articles on the Internet about how to perform regression analysis with tools like Excel. Some capacity planning tools do the forecasting part as a built-in feature.

Regression analysis will provide two variables (amongst others) which are critical for Prod Support teams:
  1. R-squared: Which tells you how good your trasaction volume is at impacting your KPIs. Because CPU percentage, for example isn't really measured in the same scale as transaction volumes, you should expect low r-squared values. The higher this value, though, the better.
  2. Correlation Coefficient: This is the key variable. What this will tell you is stuff like, if your transaction volumes go up by 1%, how much will a particular KPI change.
The KPIs also need to be compared to predetermined Service Level Agreements (SLAs). To continue with the pricing example, above, knowing how quickly we should be generating a price is critical. Let's say the target is 5 milliseconds and it's taking us 1 millisecond to generate a price, at the moment. Our forecast shows that increasing volumes by 50% will degrade pricing performance to 4 milliseconds. We know, then, that it would be OK to increase volume by 50%, as we'd still stay within the SLA.

My capacity planning tool of choice is TeamQuest, which is actually quite nice. I'm keen to hear from other folks which other tools they've used and what their experience with them is.

Wednesday, August 14, 2013

Who Moved My (Application) Cheese: Change Management

Change…It happens. Not only does it happen, it needs to happen. The world moves quickly and applications need to keep up. Markets demand new features in software, just for businesses to stay competitive. Change also enables businesses to find “new cheese” and take advantage of opportunities. Application support teams have unique positions to help enable businesses by effectively managing change to Production environments.


Change Management is a process, but also, it is an art. Taking a stand when changes will cause undue risk needs to be balanced with the need for new features and bug fixes. Having a solid, healthy partnership between Support and Development always helps keep a balance.

Most companies have ways to track change, be it via change request tickets or simply via a spreadsheet. Obviously, the former will be better than the latter but the availability of change management tools also depends on the size of the organization.

It’s critical that change always be tracked, no matter how small. Attentiveness to this detail is critical, otherwise, change management will not be effective. Most change requests will contain at least the following key features (other than the obvious, such as Change Descriptions):
  1. A Risk assessment
  2. Timing of the Change
  3. Implementation Plan
  4. Backout Plan
  5. Validation Plan
  6. Responsible Party
It’s healthy for teams to know all this information, as it can be critical information to know when faced with a change-related incident.

Besides, the obvious (sometimes tedious) task of tracking changes, other practices also provide benefits in helping manage change.

For example, I recommend that Prod support teams designate a change manager who will run a weekly meeting to review changes for the upcoming week. The meeting should have as an agenda the coverage of a list of change tickets for the week. Each item should be represented by a Development resource who can speak to the need, risk and urgency of the change. This allows the group to ask questions about changes as well as propose alternative timings and approaches to the change.

Also, many large organizations have change meetings at a larger scale (in many organizations, they’re known as Change Advisory Boards), for higher risk changes. The change manager should represent the entire Prod Support team in the CAB meetings and be prepared to discuss the upcoming changes.

Many times changes are classified into various categories with differing lead times. For example, a “Normal” change might have a 5 day lead time, whereas an “Emergency” change might have shorter lead times. In general, the shorter the lead time, the higher the risk and thus the necessity for more levels of approval. Many organizations allow certain changes to go in without supervisor pre-approval, especially during Incident/Break-Fix scenarios.

There is one type of change type that many organizations don’t have, but that I highly recommend. I call them Business as Usual (BAU) changes. BAU Changes allow teams to make small, low risk changes (e.g. change an IP address in a configuration file) without the need for much risk assessment and levels of approval. Having this lite change type enables change resiliency. The change should still have a supervisor’s approval, but in general this should be easy to obtain. BAU changes should never involve releasing code (even small SQL changes to stored procedures), if anything to prevent misuse.

Another best practice I recommend is for Prod Support teams to always have some level of approval in terms of the changes to the environments they manage. This ensures that Prod Support acts as a gatekeeper and ensures due diligence has been done before the change is made.

Finally, Prod Support should also participate in the validation of change correctness.

Monday, August 12, 2013

The Jenga Stack: Application Monitoring

One of the most critical responsibilities that an application support team has is that of Application Health Management (see my previous entry The 6 Managements of Prod Support), also known as Monitoring and Alerting. From experience, I believe monitoring application health has 3 layers (plus a few additional areas that need to be monitored, lest availability be impacted).


These three layers work much like a Jenga stack. As individual components or blocks start coming out of service, system stability starts degrading until eventually the stack comes crashing down. Application Support should be able to know that individual components have been impacted and should be able to take proactive steps to put them back to 100% service.

The three layers I’m talking about are: 1) Machine Health: Total CPU, Total Memory, Total Disk, Total Swap, Network interfaces, etc. 2) Basic Application monitoring: Running processes being up, process memory utilization, basic error and exception checking, smoke tests, etc. 3) Business Process monitoring: Transactions occurring correctly, Transactions occurring within performance SLA, Transaction acknowledgements, Transacation persistence in db, etc. For financial applications: market data feeds, user sessions, pricing, etc.

The other pieces that I didn’t include in these 3 are middleware components. The reason I don’t bundle them with the application is twofold. For one, middleware components such as application servers, databases and queuing systems are not really part of the application itself. Secondly, in most medium and large organizations, the monitoring of system health for shared infrastructure is managed by a separate team.

How to Check your Jenga Stack


Let me make one more comment, and that is how monitoring should be done.

In terms of monitoring, it’s best to use enterprise tools such as CA APM or Geneos (ITRS). I’m familiar with these tools, and I can recommend them to teams looking for monitoring tools. Although I don’t have first-hand experience with Nagios, I hear a lot of good feedback on the tool (and it's mostly free).

I see many teams use tools like the ones mentioned above to capture issues, to then just turn around and send e-mail alerts. I strongly advise against this approach. E-mail quickly becomes unmanageable, especially if alerts and business requests are flying into the same mailbox. I’ve seen way too many missed alerts with this approach, and really, it’s unfair to ask that operators not miss anything.

I once saw a group that had about 2500 e-mails daily coming into their team mailbox. The mailbox contained user requests, team responses, valid alerts and false-positive alerts. No wonder they couldn’t keep up with it! A better approach to monitoring is to utilize graphical dashboard views. Dashboards consolidate an entire system into a bird’s-eye-view and put the entire health of the system at the operator’s fingertips. Colors should remain simple in dashboards: red for critical alerts, amber for warnings and green for OK. In cases where systems are too time-sensitive, just using two colors, red and green is also OK (as sometimes, warning thresholds start getting ignored). ITRS and Introscope provide dashboarding capability (and so does Nagios).

Dashboards should start with the end in mind. For example, I always like for teams to start with system architecture diagrams. Once they’ve drawn the diagram in the tool, they work backwards to determine which alert will give them the necessary information about the various components they’re interested in monitoring.

Another suggestion I’ll make is that Prod Support teams should continuously and organically work on cleaning up false-positives. False positives eventually make teams ineffective as the team can’t tell the good from the bad alerts, and everything starts getting ignored eventually.

Thursday, August 8, 2013

The Uptime Carve Knife: Calculating Availability

As mentioned in my previous post The Purpose of Production Support, availability is what Production Support is all about. Availability is always expressed as a percentage of uptime. Uptime is the amount of pre-determined time that an application will be enabled to conduct business. Generally, uptime is expressed in minutes. For example, if an application runs 24x7, the total uptime would be 24 hours * 7 days = 168 hours, which expressed in minutes is equal to 168 * 60 = 10080 minutes. Achieving 100% availability means the application was available for the entire uptime period.


From stage left now comes Mr. Outage to rob your app of precious uptime minutes. He’ll pull out his sharp knife and carve out a slice of uptime. Keep in mind that any issue that has business-level impact, qualifies for lost uptime minutes. Let’s say the outage lasts 1 hour or 60 minutes (which, of course, would never happen to your team, since you’re applying the information in this Blog). This means that your application was now available only 10080 - 60 = 10020 minutes. This is equal to 10020/10080 = 99.40%, which is your availability.

The example above assumes that there was a full outage, where no transactions could be made in the application. Availability, however, need not be an all or nothing proposition. Suppose your application services several geographic areas, with varying amounts of business users. Let’s say the breakdown is as follows, 50% AMRS, 40% EMEA and 10% APAC. What happens to our availability number if the outage carving knife is pulled out only during a time which impacts the APAC region? Now our availability impact is really only 10% of total, and the outage minutes can be adjusted. The adjusted minutes, if we follow the scenario above would now be 60 * 0.10 = 6. The availability number now looks much better 10080 – 6 = 10074, which totals 10074/10080 = 99.94%. Voila! We’ve achieved 3-9’s!

A Prod Support team needs to determine what the various services that an application provides are. The team also needs to determine what percentage of uptime a particular service being out impacts. The percentage of user base needs to also be taken into account. Let me continue with the example above.

Suppose, now, that the outage was caused by a data feed not arriving on time. Let's say that because of this missing feed, the users can perform 80% of their activities, but not all (20%). Again, the outage is happening in APAC, so our adjusted impact in minutes is 6. Of that 6, the users could really do about 80% of the work, so we adjust that value and only account for the 20% worth of impact (6 * 0.20 = 1.2 minutes). Now our availability number looks like: 10080 – 1.2 = 10078.8, which equals 99.99%. We’re a 4-9’s shop!

What if you have multiple applications, how do we account for them as a whole. Most folks assume that an average would be an appropriate way to account for total availability. But if you remember from your 5th grade arithmetic class, this is not the way to add percentages. So, how do we do it?

Suppose you have 6 applications with availabilities as follows: 57% + 63% + 56% + 49% + 65% + 78%.

The total availability “points” your applications would achieve is 100 each or 600 (which would mean 100% availability). If we add the points up, you’ll note that they add up to 368, which out of 600 = 61.66%. This is your total availability.

The average of those same numbers yields a value of 61.33%. What this number tells you is, on average, how available your systems are. But not how they provide availability, as a whole.

Wednesday, August 7, 2013

Mayhem is Everywhere: Incident Management

Do you hear that sound? That's the sound of your application coming to a screeching halt because the "new guy" (should we call him Mayhem?) just decided to install a new test server. It just so happens that the server is configured with a duplicate IP address, bringing an entire subnet (and about 40 servers) down. *Any similarities to real events are not intended.


So what now? No insurance company (not even Allstate) can protect you from this type of mayhem, so you're on your own Prod Support guy! But you're prepared, of course, because you worked very diligently to define an Incident Management Process.

What does a good Incident Management Process look like? Good Incident Management processes handle two key pieces: Service Restoral and Communications. The former is likely more intuitive than the latter, but both are critical. Communications give your stakeholders confidence that service is being restored and some insight into why it might be difficult to recover from certain situations.

When an incident is first detected, the very first thing that Production Support groups must do is acknowledge that there is an incident and begin communicating with stakeholders: business partners, technology partners and external clients, if necessary. An initial acknowledgement (email) should be relatively easy to generate as it doesn't really require much detail. It's basically just a note that provides confidence to stakeholders that someone is looking into the issue with the appropriate level of urgency. Another important part of this initial step is to open a ticket (or add some kind of entry) in a tracking system, to ensure that we'll be able to keep metrics about the incident. A good best-practice is to put the ticket ID in the initial acknowledgement e-mail and in follow-up communications.

Once the acknowledgement is out the door (or in parallel) the incident triage and troubleshooting begins. In most shops where I've been, this is done via an incident conference call (or bridge). In many cases, the bridge convenes even before the acknowledgement is out the door. At this stage the severity of the incident is assessed and the right resources are called to help resolve the situation. One useful thing to do is to add all participants to a group chat so that it's easier to share logs, server names and other information that might be difficult to communicate via phone.

Regular updates to stakeholders of the results of the investigation and resolution steps should be sent out. I recommend that those should go out every 15 minutes - however, I've seen some shops that do this on an as-needed or even an hourly basis.

Email updates should follow a standard template. When designing the template, consideration should be given to the fact that many users are looking at the updates via hand held devices (which might not be able to render complicated tables or graphic-intensive html).

Once the issue is resolved, a clear "Service Restored" notice should go out to the stakeholders to let them know that the system is available to them once again.

We're not done yet! Once the service is restored, an assessment of availability should be performed to determine how much the incident has made us deviate from the 100% availability mark. Also, a root-cause investigation should be performed and the whole incident should be summarized in a formal Post-Mortem or Executive Summary. I'll cover more on calculating availability impact and on executive summaries in another post.

At this point, the Problem Management process begins.

Tuesday, August 6, 2013

The 6 Managements of Prod Support

There are many processes that support teams must follow. But if you can do these 6 things correctly, you'll have a solid base to build on as a Prod Support group:

Application Health Management: This is also known as Monitoring and Alerting. At any given point, the support staff needs to know how the application is doing and taking proactive measures to ensure continued availability. At the low end of the spectrum, this can be done by sending e-mail alerts when something isn't right in the environment. A more ideal solution is to develop graphical dashboards that provide a Red/Amber/Green status for the various components of the application.

Capacity Management: This is also known as Capacity Planning. Capacity Planning goes hand-in-hand with Application Health Management. By being able to correlate Key Performance Indicators in the hardware and software with transactional volume, a Prod Support team can know how stable an application will be during periods of high volume.

Change Management: Many, if not most, of the problems arising in Production, from my experience, can be traced to changes in the environment. Whether it's a botched release or someone fat-fingering an IP address in a configuration file, changes expose applications to instability. Managing who can make changes and how those changes are done can significantly help application availability.

Incident Management: Things will go wrong. Otherwise, there would be no Production Support teams to worry about availability. When things do go wrong, a speedy recovery focused on service restoral is critical to mitigate impact to business users.

Problem Management: This is how stability improves. Incident Management should logically segway into Problem Management. If Incident Management answers the question of how service will be restored, Problem Management answers the question of how the outage will be prevented from recurring.

Customer Relationship Management: Production Suppor teams should be intimately aware of who their stakeholders are and how to communicate with them. A solid documentation set describing how system outages affect stakeholders and how to contact them, enables Production Support teams to prioritize recovery steps. It also helps contact business users to let them know an issue has occurred, so that business users can execute workaround steps if needed.

Underlying all these processes is Staff Training as well as Metrics and Reporting.

Why training? You can have all the process in the world, but unless the support staff is knowledgeable about their application and how to apply those processes, they won't be effective.

Metrics and Reporting, on the other hand, is how you track that each of these processes is being followed. With metrics you identify improvement areas for the entire team and the suite of applications they support.

Make Engagement Easy

So you've got a Production Support team, now what? The team has to be accessible to clients. Who are your clients, both internal and external partners. It's important to be thoughtful about the avenues provided to engage a Production Support group. Let me provide some examples of how different teams choose to be engaged and what the pro's and con's are for each method.


Telephone hotlines are a typical way to be contacted. Having a hotline means your customers' questions are answered on the spot and phone interactions provide and opportunity to ask questions about the urgency of the request as well clarification around the problem. Hosting a hotline also means you have to staff it adequately such that users don't wait too long for an answer during the period the hotline is "open." The analysts who field questions in the hotline also have to be quite knowledgeable regarding all the features of the application - at least at a basic level. There's nothing more frustrating for a business user than the feeling of getting the runaround when faced with a critical deadline or outage.

Email provides an easy method for contact. With the widespread use of mobile technology, users have the ability to send email readily. Email, however, is free-form and initially might not contain any information regarding urgency (or clarity around what might be the issue). Email does provide the ability to attach screenshots which might help determine what went wrong. In cases where clarification is needed, it's best not to follow up with an email, but a phone call might be most appropriate.

Ticketing systems provide the ability to engage a Prod Support group while doing a bit of screening regarding the urgency and nature of a problem. They are less free form than email, but might not be as readily available to users (e.g. when the ticketing system sits on an Intranet).

More and more teams offer the ability to be contacted through group chat as well as instant messaging software. These suffer from the weaknesses of email (free form), the weaknesses of ticketing (not widely available outside the organization), and hotlines (someone needs to be ready to respond). They do provide ease of engagement when available and also allow for more interactive vetting of an issue than email.

Typically the best approach is to provide various types of engagement menthods and let the users determine which ones they'd rather use. I generally don't like tickets, as most users find it a bit easier to send an e-mail than to play 20 questions when engaging Prod Support. Emails tend to provide a happy medium between ease of engagement and capturing of data about the Problem. Of course, other methods than the ones provided here might help as well.

The Purpose of Production Support

What's the purpose of Production Support? The answer to that question will serve as our guideline for the rest of the topics in this Blog. With this entry, I'll look to provide an answer to that question by condensing various points into a single statement (that answers the question).


Most Production Support analysts and managers easily recognize that the main reason behind the systems they support is to enable business. Yet, all too easily, a system-centric focus is still maintained and, worse, we manage to that. At the end of the day though, you could have the most stable system in the world, but if it doesn't enable business, it's really worthless.

The second point follows from the first. In order to enable business, the systems must be available to the business user. Availability also implies that everything that the user needs from the system is ready. For example, if a business user needs certain data before she can execute transactions on the system, if the data is not there, the system is not truly available. So, despite the sytem being up, despite all its interfaces being connected, if the user cannot perform a transaction because something is missing, then the system isn't available (at least not 100%). The user could potentially enter a transaction, but she would be doing so blindly - which implies risk.

Another way to view Production Support is as risk management, which follows from the second point. Production Support is a form of insurance that the business pays to enable their business. We know perfect systems are rather impossible to achieve (at least at the speed with which our businesses move). So Production Support is there to proactively mitigate the risk of something going wrong. When things do go wrong, based on what we've already said, the job of Production Support is to recover the systems from outages to enable business.

Finally, we have to have manage to a target. The level of contractual performance must be pre-arranged. Thus, we arrive at an availability target, which is expressed as a percentage of system uptime. Note that uptime and availability are not the same. As explained before, a system might be up, but not available (the data is missing). Many organizations establish a target of 3-9's (99.9%) or better (5-9's 99.999%). Of course any target that makes sense and will meet the goals of the business is acceptable as long it's pre-arranged contractually - with the business.

We can simply say then, that:
The Purpose of Production Support is to maintain System Availability, at a target level, which will enable the business user to achieve their goals.

You'll notice that other sites give us some insight into the importance of Production Support as well as some of its processes (e.g. Wikipedia: http://en.wikipedia.org/wiki/Production_support). But you'll note that they don't attempt to answer the question of purpose. We'll cover many of these activities (and more) in a bit more depth, but it's important to keep in mind why a Production Support team is even needed.