Available for Consulting

Need a job? I might be able to help you find one. Need help? I'm available for consulting engagements. Send me an e-mail. Or you can contact me via Google+ or Linked In.

Monday, November 11, 2013

A Recap of all Critical Support Processes (Part 1)

This blog centers on Support processes which are critical. There might be more, but I've compiled a list of those which I consider key. This first part focuses on the top 10. Doing these will make your team good. The second part will cover all others. If you do those in addition, your team will be great!.

So without further delay:

Process NameDescriptionWhy you need it
Major Incident ManagementManaging issues with the appropriate communications level.This is how you fight fires.
Monitoring & AlertingIdentifying the health of your applications from a hardware, middleware and business level.This is how you identify issues with your applications.
Capacity PlanningIdentifies whether applications can handle increasing volumes.You're able to predict how volumes will affect your applications.
Problem ManagementAfter root cause analysis, this process tracks follow up actions.This is how you get proactive and prevent issues from happening again.
Change ManagementTracks all changes going into the environment.Changes bring instability.
CRMThis is how you track and manage your customer relationships.You know who to contact/tell about issues.
Stability ProjectsThese are short-term projects that help resolve recurring issues.These supplement your Problem Management program.
HandoversTeams finishing a shift transfer accountability to those on the next shift.Ensures that work is done following-the-sun.
Production TransitionsProcess to keep the Support team up to speed upon new releases.Ensures your team's application knowledge doesn't become stale.
Disaster Recovery PlanningProcess that outlines how to recover applications during a disaster.Ensures you can continue to do business during a major disaster event.

Wednesday, October 30, 2013

How To Transition from Support to Development

One of the trends that I see in forum discussions is Support staff asking questions about becoming a Developmer. It seems like many Support staff at some point are looking to make a transition to Development.

I've talked about this before in other posts. And although my advice is to look for Development jobs if you want to be a Developer, there is a way for Support folks to eventually make a transition to Dev.

As with any job, this transition will require that you build competency in your technical abilities and that you build a network of contacts that will help you make a successful jump.

The one thing I can say about building a good network with Developers, for Support staff, is to build solid relationships with your own Dev team. I'm sure this is not a ground-breaking revelation. A good way to do this is to learn the application well: which involves formal and informal knowledge transfer from guess who? Be responsive to issues and lend a helping hand when the Dev team needs help to get their work accomplished. If invited, make an effort to attend social functions that involve Dev staff. Knowing folks outside of work settings is always beneficial when building networks. Remember that these same Development folks might be able to point you to internal or external openings.

In terms of building your technical abilities, as a Developer, there are several ways to accomplish this. Being that your day-to-day doesn't require you to have a heavy Dev focus, you'll need to find a way to keep your skills sharp (and to build new skills) outside of work. There are several ways I can suggest for you to do this.

Take on freelance work. This is better than having a pet project at home, in that you'll have requirements and deadlines you'll need to meet. You won't just toss these projects to the side when your buddies invite you to have a beer. Take on projects that will challenge your comfort zone but, yet, give you a fighting chance to deliver. Solid results and good feedback is always welcome in your resume and on networking sites.

Another possibility is to take a college course or two. Many companies, these days, will actually pay for your tuition. So take adavantage of those programs, which can provide a solid foundation when building new skills. Again, the good thing about taking college classes is that you have deadlines to meet. Given the price of the courses, you also won't just toss them aside.

Support efforts also benefit from efficiency. Building tools to make the support effort more efficient is always a good challenge to take on. And you can do this while at work, if your extracurricular life doesn't allow for taking on side projects.

And yes, it's OK to take on pet projects of your own. Try to incorporate new things in them. For example, if you don't know how to use patterns, try to learn about them and implement your code using those designs. Incorporate best practices like code refactoring. These types of exercises will make your brain more efficient at applying these techniques when the time comes. Keeping your skills sharp will help if you want to transition to Development and it will also make you a better Support resource until then.

Friday, October 25, 2013

Keep It Together: Stability Reviews

Periodic reviews of what's been happening with your system(s) are a critical part of ensuring maximum availability. Every system will eventually become obsolete: volumes will increase, better hardware and technology will become available, software stops being supported, etc. Thus, you want to be ahead of the curve when it comes to reviewing your system and determining whether it's time to tune or change something about it.

A thorough stability review for your system should happen at least twice a year, though I suggest you do this on a quarterly basis. Most companies don't perform stability reviews until something is wrong with the system (when they've had multiple recurring issues). Needless to say, that's not the best approach, though you should never let a good crisis go to waste.

So what should the stability review entail? Every architectural component should be reviewed for improvements. For example: the application, the middleware, the databases, the network, the application hardware, upstream and downstream dependencies, etc.

The way I like to prepare for a stability review is to look at an architecture diagram. I list every single component that shows up on the diagram. Then I organize recurring meetings with the owners of each component to discuss what can be done to improve resiliency. This will become your working group. I generally find that getting buy-in to perform this type of assessment is easy. It makes sense, since most people would rather proactively resolve issues, than work on them in a middle-of-the-night firecall.

When you meet, ask each stakeholder what can be changed for improvement. For example, ask the network team whether all interfaces are redundant and whether they will seamlessly failover when something goes wrong. Talk to your DBAs to ensure that your DB is optimally tuned for the amount of data you have (Do you need to purge? Do your execution plans need to change? Are the right DB parameters set to ensure maximum throughput?). Discuss with the Development team whether things can be improved (Is seamless, automatic failover between redundant components a possibility? Can a graceful way of shutting down the application, to ensure maximum transaction safety, be coded? Can dependencies on lengthy batch feeds be removed or reduced?). Review capacity planning reports to ensure each section of the system will be able to handle the application volume.

As you review each component, action items will come up. It's important to set expectations when you kick off your stability assessment regarding the turnaround for actions. Try to get everything in place is 1-2 months. Don't let activities drag on, as otherwise, the risk of something going wrong becomes higher. There's also risk that things will never get completed.

Follow up on the action items on a weekly basis with your working group. As the actions start being implemented, your application resiliency will get better. And your confidence around your application stability will significantly grow.

Tuesday, October 22, 2013

Saying No (Even To The Business)

Many of the posts on this blog have focused on enabling the business and keeping the business in mind as we carry out Support tasks. So, you might find it strange that I'm dedicating a post to saying "no" to the same business users we purportedly help. Let me provide an example where I've directed my teams to say no to the business. But note how we did it and whether you agree or disagree.

One of the managers that reports to me brought up a concern. His team, for historical reasons, had been helping the business with even the most menial of requests. He was trying to determine how to stop these types of requests as they were robbing his team of valuable bandwidth that could best be utilized for more value-added tasks. For example, they would get calls from business users asking them why their printer wasn't working or asking them to reset their LAN id. There is a helpdesk that manages these requests, yet they were reaching out to Support to work on these tasks.

I asked him to put a meeting together with the business department head so we could talk about these requests. Ahead of the meeting, we prepared a couple of slides identifying Support Effort and where it was going. We determined that about 10% of our total Support bandwidth was dedicated to servicing menial requests that the business users should have been able to handle. We also identified various areas for improvement that we would otherwise be able to accomplish, were we not spending time, say, fixing printers.

The meeting came and we presented our case to the business partner. He was in total agreement that we should be spending our time doing things like automating manual tasks or adding better monitoring, rather than resetting passwords or clearing out blockages in a printer. We asked if he could send us an e-mail with his expectations around which items we would no longer be servicing.

When the next request came around, we responded to the business user that they should contact the help desk and we provided instructions on how to do so. We also attached the e-mail from the business head explaining that we would no longer be working on such requests. Some of the users were less than thrilled, of course. However, they eventually understood the reasons behind our inability to service those requests. Eventually, the requests stopped.

In servicing the business, we have to realize that our bandwidth comes at a premium. And it's actually in the business' best interest to have us focus on value added tasks. It's important to maintain our posture as a group and not become a dumping ground for issues that people would just not rather deal with. In this scenario, we made our case, and actually showed the business that they were better off by us not spending time on these requests. As paradoxical as this sounds, this actually helped us support them better.

Wednesday, October 16, 2013

To Patch Or Not To Patch

You've been through this before: those weeks where day after day the same issue strikes and despite your best effort at determining a root cause, you can't come to it. You've increased logging and are going through the files with a fine tooth comb; all to no avail.

Then, in a glorious burst of inspiration...Finally...You find it. It's staring right at you. It's a bug! You can fix that. In fact you have the fix, but guess what? It's Monday. Not only that, in order to deploy the fix, you'll have to bring down (impact) your 24x5.5 system for a while. When the bug presents itself, it's a lot of work (usually at around 2 AM) to correct the situation it creates. It involves updating data manually, which could be potentially dangerous. So, should you impact your system availability and patch, meaning you get to sleep and avert the risk of errors? Or do you wait until the weekend and attempt to hold the fort?

The answer to this is all about risk management, which is one of the primary goals of a support team (read The Purpose of Production Support). Patching (change) involves inherent risk. Making a change to your environment could potentially have impacts beyond what you're trying to correct. For example, what if that bug you found requires correcting a common library (meaning your have to recompile a good number of binaries).

One of the questions you should be asking yourself, too, is, how thorough was the testing? Many times, it's impossible to perform a full set of regression tests before the change has to go in.

In the scenario above, we're also subject to risks extraneous to technology. For example, what if your system is a financial trading system and an outage means your business users are unable to take advantage of a favorable move in the market?

The scenario above provides a fairly well known workaround. Another question that comes up then is, can the risks inherent to this workaround be mitigated. For example, is it possible to automate a set of SQL queries that will reduce the potential for manual errors?

The ability to identify when the bug strikes is also an important risk factor. If identifying the error is straight-forward and we have an automated workaround, then the risk becomes much lower.

So the answer of whether "To Patch or Not to Patch" requires some inquiry into many factors. Each situation will be different, with its own set of urgency as well its business and technical nuances. But asking these questions and struggling through them to make the right decision is a sign of a mature Support team.

Think about these various factors the next time you're faced with a dilemma like this. You might very well conclude that the best approach is to push the patch out a few more days. On the other hand, you might decide that the workaround is too risky to continue with, meaning you have no choice but to install the patch.

Friday, September 27, 2013

How to hire Support staff.

Hiring the right personnel is one of the most critical things you can do as a Support manager. In an era of small budgets, you need all the people you can get and you need them firing on all cylinders. Otherwise, you'll spend the next six months trying to teach your new hire about your application, your processes, your business, etc. just to find out they "don't get it."

So how do you ensure you hire people that "get it?" There are several things you can do to ensure hiring success:
  1. A good phone screen. I like to split up phone screens into various parts. First I explain the role and what I'm looking for in a candidate. Second, I have the candidate take me through their resume. I leave this open-ended so I can also assess their communications skills. I also look for them to explain how their past experience makes them a good fit. Third, I ask them a bunch of short answer technical questions, e.g. "What command would you use to find a string in a log file?" Any question regarding key words they put in their resume is fair game. For example, if you say you know C, you'd better know what the static key word in C means (hint: it's not the same as in C++). Finally, I give them the opportunity to ask me questions.
  2. Fair questions in face-to-face interviews. I generally ask three technical questions when I interview face-to-face. These are all written/on-the-board exercises. I ask a SQL question, usually how to perform an implicit join. I ask a scripting question related to the relevant OS for the role. I also ask them to write a short program in any compiled or scripting language they're comfortable with. For example, "How would you implement a function to write a string backwards?" The idea is not to "stump" people or to show off your own knowledge, but to assess whay they know.
  3. Answer your own question: "Am I able to work with this person?" This might perhaps be the most important piece of the puzzle. Even if a person has all the skills they need for the role, they'll need also need to have the right personality to get hired. You know your group's personality and culture. Never hire someone who won't fit in. You'll only end up losing them as soon as you're done training them.
The above pieces of information about a candidate should give you enough insight into their skills and their personality. You can ask more, or you can ask less. But for me, what I outlined above has become a winning formula.

Wednesday, September 25, 2013

Who's issue is this?

One of the things I try to challenge my teams with is following through on issues and user inquiries. There are many times when issues come our way, just to find out that it's really within the scope of another team to correct or address it. This can happen for several reasons. For example, from experience, a business user has determined that your team provides the best turnaround time on issues. It could also be that the documentation on whom to contact might be unclear and the business users goes knocking on the first door she finds.

In many cases, I see teams simply forward the e-mail or ticket along to another group. A lot of times, the e-mail or ticket won't have full documentation on timeline, impact, history, etc. The team receiving the e-mail or ticket might not react with the right level of urgency. In fact, I've seen issues go on for days like this; being passed from one team to another. In the meantime the business user just waits in frustration.

A better approach to prevent long running e-mail threads that lead nowhere, is for the receiving team to follow through on the issue, as if it were their own. In my opinion, if the business user sends an issue your way, you should own it to completion. Instead of forwarding the e-mail or ticket, get the right team(s) on a call, and ask the right questions. Communicate the right level of urgency on that call as well. Just choosing the right forum (phone call versus e-mail) can help significantly cut down on the turnaround time.

Once you have an answer, personally deliver it to the business user. Don't expect other teams to do it. They might not have the same finesse and level of service that you have. Remember, you always want to keep those business users delighted.

Many teams will complain that they don't have enough staffing to personally handle each of these types of issues or inquiries. I would challenge that assumption. Many times, we spent more time fighting fires and explaining bad results than it would have taken to just manage the issue to completion. Guess who the business user will complain about if the issue doesn't get addressed on time?

So skip that coffee or tea break if you have to. Challenge yourself to provide your users the best service possible. They'll thank you for it and your organizational growth will, indeed, reflect it (so will your bottom line).

Tuesday, September 24, 2013

Are You Sure About Your Monitoring?

Today we had an embarrassing issue happen. It started at 1:00 AM and we didn't catch the problem until 10:00 AM when business users reported they were missing some data. So, basically, we went about half a day without knowing something was wrong. As it turns out, we had a monitoring gap.

A log monitor which captures certain strings in the file did not capture one of the strings it was configured for. Here's the timeline of events of why it didn't capture the error:
  1. The monitor was set up to tail the log file every 5 minutes to capture everything in the log since the last time the monitor ran. This is by design with a vended application we use for monitoring.
  2. The monitor ran at 12:59 PM and didn't find any errors.
  3. The error comes in at 1:00:59 AM with the string "LOG EXCEPTION"
  4. The log file rolls because it has a size limitation.
  5. The monitor runs at 1:04 AM and tails the file again, but the error is now in the rolled file.
  6. 10:00 AM, the business user reports the problem

Gotcha! Clearly we missed this in our thinking when we set up the monitor. We've now configured our monitoring to always look at the last two log files. Since the files don't grow too quickly, that should suffice (given the 5 minute interval).

So if you use Sitescope, keep this in mind. Don't get caught with your pants down. I'm sure by now I've lost everyone who uses Sitescope for monitoring (they're now checking they don't have similar gaps). :-)

Cheers!

Friday, September 20, 2013

A Tough Interview Question and How to Ace It

I was asked the following question in a forum, and I thought it was worth sharing as a blog post:

Question (edited):

During a job interview I was given the following scenario to test my ability at handling difficult situations as a Production Support analyst.
You receive a call from two business users and:
  1. You are alone covering the shift.
  2. The issues are not documented in the run book.
  3. Both the stakeholders are insisting their issue is critical.
  4. The severity level of both issues is same.

Issue 1: The business user is saying they are unable to log into the application. This is happening to multiple business users and you know this is occurring during peak hours.

Issue 2: After the first call, you get a call from a different business user saying they are unable to generate reports to validate the data for another application.

Which issue should be given priority? How would you handle this situation?

Answer:

Several things are going on here:
  1. In the first reported issue, you didn't have clear indication of impact. If people couldn't trade, that might be more important than producing a report (the 2nd issue), despite what you heard on the phone from either partner. However, it sounded like the report was needed for reconciliations, which some groups depend on for trading. In this case, you need clarification on the issues. Get the business users on the phone again and get find out more. One of the things you have to get really good at in Support is to ensure you really understand the problem. Sometimes, calling the users back and getting clarity is the only way to accomplish this.
  2. If you're alone in a shift and need help, call and get it. It's better to take a few minutes and escalate, than to try going at it alone. Remaining calm and really thinking about the best approach is a sign of maturity in a Support associate. Wake someone up if you have to. I always tell my guys it's better to wake someone up than to let things fall apart causing financial loss.
  3. Keep in mind that there are no two issues that are really, exactly the same in terms of urgency. One is usually more urgent than the other. But suppose they were the same and you can't get help. In this case all you can do is work them on a first-come-first-serve basis. You being the sole person on a shift and not having enough bandwidth to handle multiple issues is more a sign of bad coverage (and ineffective management) than anything. Of course, you wouldn't say that in an interview ;)

I hope you find this answer helpful and that it will help you ace your next Production Support job interview!

Wednesday, September 18, 2013

Examples on Increasing Productivity: Part 2 (For Managers)

A reader asked me a good question: "Would you provide concrete examples on how to increase the productivity of my Production Support team?" Although I'll answer with points that would be important for any team, not just Support, I'll provide examples that apply more directly to Production Support groups.


This is Part 2 of this article. Click on the link to go to Part 1.

The first thing you can do as a manager is to set high expectations for yourself and your team. This means two things set expectations and make sure they're challenging enough. One thing that has worked great for me is doing strategic planning at the beginning of every year with all my directs. We keep it simple. We identify things we'd like to improve about the applications we support (The Challenges). Then we come up with Action Items. Action Items define three things (Where we are, Where we want to be, and what we're going to do to get there). Incidentally, there's a technical name for just talking about the challenges and not coming up with what you're going to do to fix them...It's called complaining.

Make sure the action items achievable (yes, you can use SMART goals), but make sure they'll also challenge your team to do their best. Having clear guidelines and defined projects has worked wonders for the amount of work that my groups achieve. People don't come into work wondering what they're going to do. If the BAU work (incidents, service requests, etc.) is low, then it's time to pull out the plan and work on those strategic objectives. I review plan progress on a weekly basis and provide quarterly updates to senior management to ensure my directs' work is being highlighted and they're getting the right visibility level.

Keep a constant eye for ways to maximize the productivity of your team. Just because you have a plan doesn't mean you can't include important items on the go. Also, if something that was previously identified as important, no longer is, then remove it from the plan. Work only on those things that will add value to your group.

As I've said in prior posts, take time to reward and reinforce productive behaviors. Production Support teams go through a lot of stress and team members need to know that their work is not going unnoticed.

Finally, use your metrics to determine areas for improvement and to track how Productive your teams are being. If all the work you've planned to do is not having a positive impact on your Availability metrics, Support effort, Time tracking, etc., then you're not focusing on the right things. Keep the Purpose of Production Support in mind in everything that you do.

Examples on Increasing Productivity: Part 1 (For Associates)

A reader asked me a good question: "Would you provide concrete examples on how to increase the productivity of my Production Support team?" Although I'll answer with points that would be important for any team, not just Support, I'll provide examples that apply more directly to Production Support groups.


So, let's start with what you can do as an associate to improve the productivity of your team. There's an implication, here, and that is, that productivity increases are not just the responsibility of managers (though we'll talk about things managers can do). First of all, be open, ask your manager the question "What can we do to be more Productive?" Many times as Support groups, we get bogged down in the day-to-day, tactical, activities and we don't spend enough time thinking strategically. A question like this one, during a team meeting, might spark a conversation with your entire group about the things that can be put in place. Collect those ideas and come up with approaches to get them effected.

Another thing you can do is determine what you can do to increase internal and external client satisfaction. Let me provide an example of each:
  • Internal: I just had a conversation last night with one of my directs. A user had asked a question and it was taking longer than normal to resolve it. The gist of it was that it was a different Support group who should have been handling the query, but somehow it landed on my team's lap. What my team had done was forward the e-mail to the other Support group and there had been no response. My challenge to the team was that we should take more ownership of issues. It would have been better to call the user to clarify the problem. Instead of sending an e-mail, it would have been better to engage the other Support group directly, over the phone, so that a richer conversation could have happened. This would have been a great opportunity to transfer accountability, reassign tickets, convey urgency, etc.
  • External: In a prior gig I had, we had many institutional banking customers who connected to our systems to receive prices on financial instruments. If a client was not connected to us, they were also not dealing with us. This means loss of revenue, of course. The went through the logs and found out approximate times that customers normally connected (we didn't have documented SLAs, a problem we inherited). We set up monitoring for each customer and we put a threshold on the monitoring such that, if they didn't connect after a period of time from when they normally did, an alert showed up in our dashboards. This prompted us to call the client and ask them to connect. Many times they didn't know they weren't connected. This small effort increased revenues for the bank and customers really appreciated being notified.

Even associates can help when it comes to expense reduction. I was at a company where we used a monitoring tool that cost over $1MM in licensing annually. It was quite feature rich, great graphical interface, etc. But as it turns out, we needed something a bit more basic. A simple dashboards that would display alerts was all that we needed. Most of the Support people in my group had Development backgrounds, so we took on a project to build a monitoring tool. A few months later we delivered the tool and were actually able to replace the vended software. We saved that $1MM in expense.

Increasing productivity might also be defined as stopping low value tasks and doing more productive tasks. I was in a Support group where the monitoring was quite noisy. There were tons of alerts and people would clear them out every day. Day in, day out, clear the alert. Repeat. Doing this is low value. Instead, we cleaned up the monitoring. We put a list together of noisy, false-positives and embarked on a project to clean them up: configuring the tool to ignore some, reclassifying the severity of the alert, removing the alert from the code altogether, etc. The now quiet monitoring tool enabled us to focus on more value added tasks, like building automation and putting together tools to help the support effort.

If you are a manager, click here to go to Part 2 of this article.

Monday, September 16, 2013

A Lesson Learned

We had reached a critical point in the meeting. For several weeks now, we'd been focused on defining a laundry list of projects that we were about to embark on. The goal was to standardize Support processes across the organization to achieve greater efficiencies in terms of: tracking metrics, managing incidents, monitoring and alerting, etc. You name it, we had it covered. All of the Support processes were to be same across the organization. Things were going to be much easier for everyone.


Right then, one of the managers declared that he had no interest in doing the work. So, we asked why. Was it a bandwidth concern? Was it a funding problem? Did she not find value in performing the work? The answer to all the questions was a No. So what was it we asked. Her response: "My manager simply doesn't care whether I do the work or not. She hasn't asked me to do it, so I don't think I really need to."

Someone else chimed in and said something quite similar.

There are several things we can learn from this story (true story, by the way). The first is for us Support managers out there:

Show your teams you care about their work.


Support teams go through a lot of stressful situations. It can be a thankless job. But if you as a manager don't take the time to acknowledge your group's efforts who will? There is nothing that kills momentum and initiative more than managers who don't recognize their groups' efforts. For Support teams, not having engaged managers who recognize the importance of the Support effort can be deadly: People get burnt out. The due diligence in monitoring goes away. They snooze on alerts instead of reacting aggressively. Or they simply feel too disengaged to work on those initiatives that can really make things better.

As managers we need to learn to take time to celebrate your teams' accomplishments. Send that thank you note or two. Gather the troops around and recognize that person who went the extra mile. A small cheer or clap might be all that's needed to re-energize that team member who used to be great but has fizzled out a bit.

For team members there's something to learn from this story as well:

Do the right thing.


Never stop doing the right thing, just because you don't think your manager cares. There's value in Availability metrics (this is a report on how well your apps are doing). Be relentless with Problem Management, this is what makes your applications more stable. Take on those projects that will help make it better all around. Never give up. Your efforts will be recognized. Who knows, perhaps one day you'll have the leadership of the group and you can be a different kind of manager.

In the end,We do what we do, not because our manager cares. We do it to enable a business. It takes a special person to wake up in the morning, know you're going to do Support and still come into the office with a smile on your face. In many respects the terms Application Support or Production Support don't really do justice to what we do.

So, keep the goal in mind and keep driving towards it. Your business will thank you for it and you'll feel much better about those daily achievements that come from Production Support.

Friday, September 13, 2013

The Info You Need When You Need it Most: Runbooks

For this post, I'm going to continue to focus on the knowledge aspect of an application. In particular, I'll talk about Runbooks.


Runbooks should be the first point of reference for anything related to an application. Each and every application you support should have a runbook. Otherwise, it would be like flying an airplane without a manual (for those who didn't catch the reference, every pilot has to use the airplane manual when starting it, no matter how familiar they are with the model).

Runbooks should contain some key information about an application.

The most important section a runbook should contain is a Business Context section which provides the users some idea of the business processes, their criticality and potential financial impact. Most runbooks I've seen don't contain this section, but I like to have this in place. This section should help to further solidify to a group of techies that they don't support some technology or application, but a business instead.

Runbooks should inform the analyst about the Architecture of an application. It should provide an overview of the servers and databases they communicate with. The Architecture section should provide a network context for the application, as well. It should also depict any middleware being used and also provide an idea of other upstream and downstream dependencies.

Another key section for the Runbook is an Administration section. This section should provide the user information about things like how to restart processes, scheduled jobs, breakglass procedures and start/end of day checks.

Likely, the most critical section in a runbook, when it comes to incidents, is a Monitoring and Alerting section. This section of the runbook should provide a list of common alerts and how to resolve them. This section might also contain information about the eyes-on-glass procedures for monitoring the application.

Next in criticality from the Monitoring and Alerting section is the Escalation section. The contact details for Development and Key Business users should be documented there. Also, contact information for key Infrastructure teams and Upstream/Downstream teams should be captured.

A section which provides more detail about how the application works would be an Application Deployment section. This section should contain information like which locations an application is deployed in and what dependencies it has.

The Monitoring and Alerting section should be supplemented with a Troubleshooting section which captures the most common issues, known bugs and limitations.

A Tools section in a runbook which contains the common tools the team utilizes for troubleshooting might be a good thing to document as well. New team members would certainly appreciate having a handy list of the tools their teammates use and perhaps links to downloading/installing these tools should be there as well.

A final word about Runbooks. Do you want to assess your team's proficiency when it comes to application knowledge? Make a bulleted list with each section of your runbook. Pick some topics from each section and make a little quiz. You'll now have a quick and dirty way to find out their proficiency level.

Monday, September 9, 2013

Four Key Areas your Support Team Needs Training In

Everyone knows their staff needs training, but have we given any thought into what training should entail? Let's start with answering WHAT training should entail.
There are four key areas your Production Support Staff needs to be trained on:
  1. The Application: You cannot be successful unless your Production Support staff knows the application they're supporting. You can have all the process in the world, but if your Support guys don't know the application, they won't be able to support it. I've seen very few Production Support teams who have staff training plans, especially for new hires. With the way Production Support teams are budgeted for, this is a mistake. Typically the reason you're hiring someone is because you have some urgent need: someone left or you're taking on Support for a new tool. However, if no staffing plan is in place, it'll take six months to a year, depending on the complexity of your application, before that new joiner is truly productive.
  2. Your Processes: Your team needs to know what they need to do to ensure your processes are being followed. A good portion of the posts on this blog so far have been focused on the necessary Support processes. For example, unless your team members know how to enter a ticket (and why this is necessary), they won't do it correctly. This could mean that your metric tracking will be off, as perhaps you won't have a record of key issues or all Support effort. Not following correct change management process could mean a botched release, or even worse, some really uncomfortable meetings with auditors.
  3. Your Releases: Just because someone knows the application, doesn't mean they know all about the new features being pumped into it with every release. You can consider initial Application training to be more like you providing your staff member a snapshot of it. But keep in mind that the Application will continue to evolve.
  4. Your Stakeholders: Again, we Support a business. Support staff need to know who their business is, who the key players are and how it's organized. What things are urgent to the business should also be covered.

Now, let me provide you some ideas into HOW you can train your staff.

In the simplest case, put together a list of topics starting with an architectural overview of the application, moving into the key areas of the app and concluding with who the business is (and how they're organized). Use that as a template to put together a PowerPoint deck that will cover key highlights of each topic and have one of your staff members provide an overview to any new staff arriving.

In order to make the process more efficient, as you cover the slides, do it in a tool like Webex. Use Webex to record the session and the information being presented. Now, you'll be able to distribute that out to new staff without the need for a presenter.

For your processes, you can do more scenario-based type training. For example, you could use some of your previous incidents to come up with a scenario to train people in your Incident Management process. You can try to mimic the situation and allow the person being trained to explain what they would do to resolve the issue. It's important that this be detailed enough to determine gaps and provide suggestions for improvement.

Finally, for releases, it's important that your team has a forum with the Development team, for at least an hour, if not more. At a basic level you can go through the release notes and have a Q&A session. Ideally, however, your Development staff is coming up with more polished training decks that they want to cover with you for every release. Find a way to get buy-in from your Development team to ensure that if there is no training they will have to help you support any issues related to new features. At least have them understand that you will fire call them (in the middle of the night, if needed) if they don't train you.

In any case, never underestimate the need to train your staff. There is nothing more discouraging to a Support team than having someone who cannot contribute to the Support effort. There's also nothing more frustrating for a Support analyst than not being able to help. So, you have to give all your team members a fighting chance at success.

Friday, September 6, 2013

Be a Part of the Solution: Be a Hero

Remember the purpose of Production Support? There are several things that we should consider doing, as individuals, in order to accomplish our purpose as a Production Support team:
  1. Being flexible. I once knew a Production Support manager who's contention was that Production Support teams are "gatekeepers" of Production. This is a true statement. What wasn't true was his contention that any release that went into Production had to follow the Production Support procedures to a tee and that if the Development team didn't follow them, the Prod Support team should push back and delay the release. Does this meet the goal? The answer is no. Though we all want a perfect world, sometimes we need to be flexible in order to get to the goal. If the business needs a new feature urgently, to take advantage of a market opportunity, is it reasonable to expect them not to make money because Production Support didn't have all the documentation checked off? I don't think so. As with many things, finding the right balance between due diligence and meeting business objectives is an art.
  2. Raising our hands. In all companies there are two types of people: those who do and those who don't. You've seen the ones who don't. They have an opinion about everyting in meetings. They're experts at letting you know what shouldn't be done and not what should be done. They're great at telling you that your approach won't work, despite evidence to the contrary. But when it comes time to do the actual work, they vanish. I always encourage Production Support team members to be the doers. All that stuff that no one wants to do, but which is important should be something that we should be willing to do. If our aim is to meet our purpose as a Production Support team: do that break-fix even though your stated SLA says you only do Level 2 Support, manage that special project which really should be done a different group, answer the call when someone asks in a meeting who will do the work and you hear silence. Remember not to argue over who's going to do it, as that takes more time, energy and money sometimes, than actually doing the work.
  3. Doing the right thing. Many times it's difficult to do the right thing, especially when the number of issues has been high and you feel mentally and physically fatigued. At times, some of the most rewarding work can come when you push yourself a little and you do those things everyone hates doing. For example: clean up that monitoring system and find a way to remove that false-positive alert instead of just clearing it; create that ticket in the system even though the issue is already fixed; run that report and evaluate trends to make sure that the system won't run into issues. The more disciplined you are at doing those little things everyone hates doing, the easier the work will become for you and your teammates.
So be daring. Be a part of the solution, not the problem. Enable your business. In short, don't be afraid to be a hero.

Wednesday, September 4, 2013

Follow The Sun Success: Handovers

According to Wikipedia, Follow-the-Sun, is a type of global workflow in which tasks are passed around daily between work site that are many time zones apart. The idea behind follow-the-sun is that work will never stop.

Though evidence suggests that Follow-the-Sun Software Development doesn't work, this is not the case for Production Support. Follow-the-Sun can and does work, if Prod Support teams are willing to implement a few best practices to enable collaboration.

In the shops where I've worked, the most common setup is to cover 12 hours out of AMRS and 12 hours from India. Typically each region has an early (8:00 AM - 5:00 PM) and a late (11:00 PM - 8:00 PM) shift. The shifts are modified for Daylight Saving Time adjustments. Another common setup is to cover 8 hours out of AMRS, 8 hours out of APAC and 8 hours out of EMEA (where there is overalp between the shifts).

The most important process that teams need to implement, is likely the Handover process. During the handover, the accountability for the work transitions from one region to the next. There are two items that make handovers successful, in my experience:
  1. Handover E-mails: Handover e-mails should contain information about open Incidents and Service Requests. They should also contain a recap of significant issues that occured during the shift, for example: Major Incidents or preventive restarts of running processes. Handover e-mails ensure that a snapshot of the work being handed off is captured, which provides better insight into accountability.
  2. Handover Calls: Handover calls should be short (about 30 minutes) and should be utilized to cover the open tickets being handed off. It is a forum to allow for clarification of what needs to be done. It is NOT a forum that should be used to work on the tickets. Handover calls should start and end on time and should have enough representation for each ticket being handed off. During the handover call, tickets should be reassigned to people in the upcoming shift. At no point should a ticket remained assigned to people in the shift handing off, as accountability is lost and the work won't continue on it until the following day.

Another best practice that enables follow-the-sun success for Prod Support is that of implementing Start and End of Day Healthchecks. At the beginning of the week, a start-of-day check should be carried out to ensure systems are ready to perform business transactions. Then, at the end of each shift, the team receiving the system should carry out health checks to ensure that everything will run smoothly during their time. I find that doing it this way works better (than the team handing over doing them) for two reasons: 1) The team just starting is freshly rested (and isn't ready to run out the door) and 2) The team just starting will have the accountability for any issues (thus they are more invested in things not going wrong).

Start and End of Day Healthchecks should be documented. A summary of the checks performed and the results (perhaps with a Red/Amber/Green status) should be sent out to interested stakeholders to ensure that everyone is aware the system is ready. It's worth noting that healthchecks can be automated. If they are, the team receiving the handover would be responsible for correcting any anomalies the healthchecks might reveal.
Following these simple guidelines is easy and will work wonders in ensuring your Follow-the-Sun success!

Saturday, August 31, 2013

Get Some Sleep: Problem Management

If you've been in Production Support for any length of time you've had one of those weeks: where everything seems to fall apart, all at once, and you get woken up in the middle of the night to troubleshoot what seem to be never ending issues. It happens to all of us, including companies like Amazon.com, Outlook.com and, yes, even Google.com (check this link to see what I mean).


Outages will happen and effectively dealing with them requires a strong Major Incident Management discipline. The work isn't done, even if service is restored. It's time for Problem Management!

Problem Management is the discipline that helps prevent outages from happening again (and which will allow you to get on the road towards higher system availability). There are 3 key components to Problem Management:
  • Root-Cause Identification
  • Identification of Follow-up Actions
  • Tracking Follow-up Actions to completion

Every single incident has a root-cause. I repeat, every single incident has a root-cause. You might have a difficult time determining the root-cause, but that doesn't mean there isn't one. And in most cases, the issue will reocurr if steps aren't taken to correct the situation.

I've seen many a Production Support analyst do this: every time an issue shows up (especially minor, recurring ones), they manually intervene, correct the problem and sit back. There is absolutely no value in doing this! This means the issue will come back and there is risk in impacting users. Not to mention that it's just a bunch of needless work. I don't understand the rationale behind this approach, but in case people want to be depended on, this approach doesn't ensure job security either. If anything, it's good way to get walked out the door the day stuff hits the fan and your manager finds out you've know about the problem for months. I once found a situation where a manual process was being done, for 12 years, to handle a particular situation!

For every single root-cause, there is one or more actions that can be taken to prevent the issue from happening again. Each of these follow-up actions needs to be tracked to completion. At most places where I've worked, each of these is put into a Problem Ticket. The follow-up actions might involve the intervention of other teams, for example, Development groups. So it's critical that the Problem Management process is adopted by the entire organization and that all stakeholders are ready to work on the tickets that get assigned to them.

Follow-up actions are typically discussed, created and assigned during a post-incident review. This can be informal process, if the incident is small, but can be a very formal, high profile meeting for high severity, critical outages. I'll write more about how to run an effective post-incident review meeting in later post.

There will be organizations where not all stakeholders are on board with being assigned tickets and tracking them to completion with the right level of urgency. In these cases, I've had success by getting a few key people from those groups to attend a recurring meeting. I've found weekly meetings to be most effective, but semi-weekly or even monthly might do the trick. During the meeting, the Problem Tickets are presented with a focus on the highest priority ones. The general idea is to determine what can be done to implement the follow-up action and determine when it can be prioritized to be worked on. This is a balancing act, of course. But being relentless, and the due diligence of organizing the tickets and presenting them, inevitably lends credibility to the process. External teams are more willing to work on implementing the solutions if they see you're committed as well. I was once at an organization where we had about 300 follow-up action tickets when I first started there. In about a year, that number had dropped to about 20 outstanding items, including new problem tickets that were being opened.

One more thing about Problem Tickets. Incidents shouldn't be the only source for creating Problem Tickets. Recurring issues, enhancements, and automation opportunities can also be tracked in Problem Tickets. They might be prioritized differently, of course, but the general idea is the same.

So start working on your Problem Management discipline. You'll have great system availability to show for it - not to mention, you'll be able to sleep, finally.

Thursday, August 29, 2013

Who you gonna call? CRM for Production Support Teams

Many Production Support teams don't consider Customer Relationship Management (CRM) an important part of what they do. In my opinion, there are very few things that matter more than knowing your customers and their needs intimately. I'll provide a few examples that will lend insight into why I feel this way. Before I do that, let me explain that I use the word customer interchangeably with stakeholder. Thus, customers can be internal business partners, external business partners, vendors, other internal teams, etc.


First of all there are basic needs that all Prod Support teams have in terms of knowing which teams to contact when resolving an issue goes beyond the Prod Support team's scope. For example, knowing how to contact a Network or DBA team can be critical. Thus Support teams need to have a well-known, easy to navigate Escalation Database. Most organizations I've been in utilize intranet Wikis or document management sites like Sharepoint to keep an escalation list. It's critical that these DBs be up to date and should contain critical information like:
  1. The documented team name: I can't stress this one enough. Many large companies have several teams that handle networks for example, knowing which team to call is critical. For example, the team that handles routing might not be the same team that handles load balancing, yet they both might land under the network umbrella.
  2. The escalation rotation with the hours that the rotation encompasses
  3. The contact telephone number and/or e-mail
Each person in the support team must know exactly where to find the escalation database and accessing this information should be easy. There are few things more important, when you get a new person in your Support team, than providing them the information of whom they should call.

A few more thoughts on the Escalation DB: It should also contain the names of all the support staff and managers. There should also be a process to maintain the Escalation DB (most teams will update the information organically, however, giving the Escalation DB a quarterly or semi quarterly review is always healthy).

More advanced CRM practices can provide Support teams valuable information about clients, especially external customers. Utilizing CRM tools like Salesforce.com or Microsoft Dynamics allow teams to track various pieces of information for customers. For example, different customers might have different times when they connect to a system to perform transactions. Also, it might be good to know which products a customer uses versus others.

True power of CRM systems (which can provide competitive advantage) comes when they're accessible to both business users and support staff. Well-documented and adopted processes to gather information about a customer and their transacting patterns allows teams to better prepare and support them. For example, when onboarding a new customer, a sales representative for a financial trading system might mention that a particular customer performs most transactions right before market close (let' say between 3:00 PM - 4:00 PM). Thus, the Production Support team might add extra monitoring to keep an extra eye out for this customer and their trasactions during this time. The team might notice that the customer isn't connected and ready to do business and might want to preemptively call to make sure that everything's OK. In many financial trading systems, revenues come in the form of fees on transactions. If the customer can't do business this could mean losses in revenues.

Thursday, August 22, 2013

Time Track to Identify Hotspots

I haven't met a support analyst, yet, who likes to enter tickets into tracking systems. This practice, however, is a necessity for support teams. Consider it a necessary evil, if you will. In fact, all support staff needs to understand that ticketing is just part of their responsibilities. Now, that doesn't mean we shouldn't explain why ticketing is necessary to analysts. With this post, I plan to explain the need for ticketing.


In my mind, time tracking's biggest benefit is to help with identify support hotspots.

Time tracking should tell us things like:
  1. How much time teams are spending with incidents?
  2. How much time teams are spending doing service requests?
  3. How much time are teams spending on change requests?
  4. Which applications utilize most of the support bandwidth?
  5. Which application components the biggest offenders?
  6. Are there recurring issues which should be corrected?
  7. Are there recurring service requests which should be automated (e.g. reports)?

Deep diving into the time tracking data, to get the answers to those questions, will provide insight into which areas should be the focus of stability programs. Deep dive sessions, where tickets should be reviewed for trends and correlated with availability numbers, should be programmed to occur regularly (I suggest at least quarterly) Once the various hotspots are identified, problem tickets should be opened for follow up actions (more on Problem Management in my next post).

Time tracking is also important for metrics and reporting purposes. In fact, time tracking is the basis for metrics and reporting. We can't report unless we're keeping track of issues and how they're affecting system availability. Important things to report on are:
  1. Availability metrics
  2. Top Talker incidents: These are the incidents which caused the most impact to availability
  3. Incident counts
  4. Service request counts
  5. Change request counts
  6. Support effort expended
Why do we even care about metrics and reporting? We care because we can't fix it, if we don't know if it's broken. Support teams can't work on "hunches" and need to focus their attention on true problems. If you can't measure it, you really can't fix it.

Finally, let's not be naive. Support is a cost center, and many senior managers are interested in making sure that support teams are staffed only with the necessary folks. Why spend the extra dollars, if we can't justify the expense. So at lest one point of time tracking is to justify staffing needs. For support managers, it also helps forecast future staffing needs as the application changes (which can mean an increase or decrease in staff).

Friday, August 16, 2013

Level with Me: The Levels of Production Support

I'd like to take a pause and answer a question: What are the levels of Production Support?


This is a question I see a lot in forums all over the Web. It regards what the different Support levels mean. Of course, this already is a loaded question because it assumes all companies have different levels of Prod Support. I can tell you, from experience, that this isn't always the case. Some companies' Support staff answer the phone, perform investigations on questions, and also perform break fixes (which would be a combination of the typical levels of Support).

But the question that most people mean to ask is, "In a 3-Level Support Model, what does each level mean?" This question is good and relevant because many, many organizations structure their Support teams this way.

So without further ado, here are the 3 Levels of Production Support...and what I notice they typically mean in various organizations:
  1. Level 1: This is the most basic of Support levels. Staff performing this level of support generally perform eyes-on-glass type monitoring. In the event of an alert, Level 1 staff rely on Standard Operating Procedures (SOPs) that they need to follow in order to correct a situation. The SOP might call for staff to run a script, restart a process, etc. Level 1 staff also field business requests and prioritize tickets. If Level 1 staff doesn't have an SOP to resolve the question or request, they must escalate to staff performing Level 2 Support. At this level, no in depth troubleshooting is done; only troubleshooting that will help define the issue at hand is undertaken.
  2. Level 2: This level of Support generally involves more in-depth troubleshooting and requires that the staff understand the way a system works quite intimately. Level 2 Support includes finding resolution to incidents for which SOPs are not available and finding answers to questions which involve intimate knowledge of the system. This might include poring through log files to determine why something went wrong or why a system transaction behaved in an unexpected way. Level 2 Staff might utilize the application code to resolve an issue. They're also in charge of handling service requests that involve creating custom queries to automate reports. When Level 2 staff cannot handle a particular request, they must escalate to Level 3 Support.
  3. Level 3: Level 3 Support involves an in-depth understanding of an application, so much, that break-fixes can be done by this level staff. Ideally, Support staff can generate this level of knowledge, however, many times, Level 3 Support is done by the application Development team.

Thursday, August 15, 2013

Capacity Management Shouldn't Be An Afterthought

Capacity management or planning is the often forgotten process that can make a huge difference in terms of application availability. When applications first come online, both the hardware and the software are sized to handle the volumes of business at that point in time. However, transaction volumes never stay the same and typically grow at faster rates than anticipated. The reasons for this vary, but generally, new applications enable business efficiencies to help it grow. Also, it's very hard to anticipate business growth volumes and external factors such as regulatory changes. All these changes can increase the demand for resources in the system, and Production Support teams need to be aware how these growths in volume can affect application performance.


Capacity Management attempts to avoid reactionary approaches towards sizing systems (usually at the point of impact) and instead it looks to cost-effectively avoid system degradation due to volume increases.

At a basic level, Capacity Management programs need to track basic system Key Performance Indicators (KPIs) such as CPU, Memory, Disk, Swap and Network utilization. The real power of capacity planning comes when transaction volumes can be correlated with the individual system KPIs. This enables teams to answer critical questions such as: If volume increases by x%, how well will my system perform? If your capacity management program doesn't answer that question with a good degree of certainty, then you don't have much of a program at all.

Beyond the basic KPIs, Production Support teams need to identify system-specific metrics that can give them insight into the performance of the tools they manage. For example, I used to run a group in charge of supporting FX platforms. A key indicator for FX is how quickly a price can be generated once markets tick. So pricing latency was a critical KPI for that system.

Capacity metrics need to be tracked and reported on, at least, a monthly basis. A thorough analysis needs to be done to determine how transaction volumes are affecting KPIs and a forecast analysis should be done to determine how upcoming volumes will impact performance. This can be done with simple regression analysis, where you look to determine how your transaction volumes correlate to your KPI numbers. You can find many articles on the Internet about how to perform regression analysis with tools like Excel. Some capacity planning tools do the forecasting part as a built-in feature.

Regression analysis will provide two variables (amongst others) which are critical for Prod Support teams:
  1. R-squared: Which tells you how good your trasaction volume is at impacting your KPIs. Because CPU percentage, for example isn't really measured in the same scale as transaction volumes, you should expect low r-squared values. The higher this value, though, the better.
  2. Correlation Coefficient: This is the key variable. What this will tell you is stuff like, if your transaction volumes go up by 1%, how much will a particular KPI change.
The KPIs also need to be compared to predetermined Service Level Agreements (SLAs). To continue with the pricing example, above, knowing how quickly we should be generating a price is critical. Let's say the target is 5 milliseconds and it's taking us 1 millisecond to generate a price, at the moment. Our forecast shows that increasing volumes by 50% will degrade pricing performance to 4 milliseconds. We know, then, that it would be OK to increase volume by 50%, as we'd still stay within the SLA.

My capacity planning tool of choice is TeamQuest, which is actually quite nice. I'm keen to hear from other folks which other tools they've used and what their experience with them is.

Wednesday, August 14, 2013

Who Moved My (Application) Cheese: Change Management

Change…It happens. Not only does it happen, it needs to happen. The world moves quickly and applications need to keep up. Markets demand new features in software, just for businesses to stay competitive. Change also enables businesses to find “new cheese” and take advantage of opportunities. Application support teams have unique positions to help enable businesses by effectively managing change to Production environments.


Change Management is a process, but also, it is an art. Taking a stand when changes will cause undue risk needs to be balanced with the need for new features and bug fixes. Having a solid, healthy partnership between Support and Development always helps keep a balance.

Most companies have ways to track change, be it via change request tickets or simply via a spreadsheet. Obviously, the former will be better than the latter but the availability of change management tools also depends on the size of the organization.

It’s critical that change always be tracked, no matter how small. Attentiveness to this detail is critical, otherwise, change management will not be effective. Most change requests will contain at least the following key features (other than the obvious, such as Change Descriptions):
  1. A Risk assessment
  2. Timing of the Change
  3. Implementation Plan
  4. Backout Plan
  5. Validation Plan
  6. Responsible Party
It’s healthy for teams to know all this information, as it can be critical information to know when faced with a change-related incident.

Besides, the obvious (sometimes tedious) task of tracking changes, other practices also provide benefits in helping manage change.

For example, I recommend that Prod support teams designate a change manager who will run a weekly meeting to review changes for the upcoming week. The meeting should have as an agenda the coverage of a list of change tickets for the week. Each item should be represented by a Development resource who can speak to the need, risk and urgency of the change. This allows the group to ask questions about changes as well as propose alternative timings and approaches to the change.

Also, many large organizations have change meetings at a larger scale (in many organizations, they’re known as Change Advisory Boards), for higher risk changes. The change manager should represent the entire Prod Support team in the CAB meetings and be prepared to discuss the upcoming changes.

Many times changes are classified into various categories with differing lead times. For example, a “Normal” change might have a 5 day lead time, whereas an “Emergency” change might have shorter lead times. In general, the shorter the lead time, the higher the risk and thus the necessity for more levels of approval. Many organizations allow certain changes to go in without supervisor pre-approval, especially during Incident/Break-Fix scenarios.

There is one type of change type that many organizations don’t have, but that I highly recommend. I call them Business as Usual (BAU) changes. BAU Changes allow teams to make small, low risk changes (e.g. change an IP address in a configuration file) without the need for much risk assessment and levels of approval. Having this lite change type enables change resiliency. The change should still have a supervisor’s approval, but in general this should be easy to obtain. BAU changes should never involve releasing code (even small SQL changes to stored procedures), if anything to prevent misuse.

Another best practice I recommend is for Prod Support teams to always have some level of approval in terms of the changes to the environments they manage. This ensures that Prod Support acts as a gatekeeper and ensures due diligence has been done before the change is made.

Finally, Prod Support should also participate in the validation of change correctness.

Monday, August 12, 2013

The Jenga Stack: Application Monitoring

One of the most critical responsibilities that an application support team has is that of Application Health Management (see my previous entry The 6 Managements of Prod Support), also known as Monitoring and Alerting. From experience, I believe monitoring application health has 3 layers (plus a few additional areas that need to be monitored, lest availability be impacted).


These three layers work much like a Jenga stack. As individual components or blocks start coming out of service, system stability starts degrading until eventually the stack comes crashing down. Application Support should be able to know that individual components have been impacted and should be able to take proactive steps to put them back to 100% service.

The three layers I’m talking about are: 1) Machine Health: Total CPU, Total Memory, Total Disk, Total Swap, Network interfaces, etc. 2) Basic Application monitoring: Running processes being up, process memory utilization, basic error and exception checking, smoke tests, etc. 3) Business Process monitoring: Transactions occurring correctly, Transactions occurring within performance SLA, Transaction acknowledgements, Transacation persistence in db, etc. For financial applications: market data feeds, user sessions, pricing, etc.

The other pieces that I didn’t include in these 3 are middleware components. The reason I don’t bundle them with the application is twofold. For one, middleware components such as application servers, databases and queuing systems are not really part of the application itself. Secondly, in most medium and large organizations, the monitoring of system health for shared infrastructure is managed by a separate team.

How to Check your Jenga Stack


Let me make one more comment, and that is how monitoring should be done.

In terms of monitoring, it’s best to use enterprise tools such as CA APM or Geneos (ITRS). I’m familiar with these tools, and I can recommend them to teams looking for monitoring tools. Although I don’t have first-hand experience with Nagios, I hear a lot of good feedback on the tool (and it's mostly free).

I see many teams use tools like the ones mentioned above to capture issues, to then just turn around and send e-mail alerts. I strongly advise against this approach. E-mail quickly becomes unmanageable, especially if alerts and business requests are flying into the same mailbox. I’ve seen way too many missed alerts with this approach, and really, it’s unfair to ask that operators not miss anything.

I once saw a group that had about 2500 e-mails daily coming into their team mailbox. The mailbox contained user requests, team responses, valid alerts and false-positive alerts. No wonder they couldn’t keep up with it! A better approach to monitoring is to utilize graphical dashboard views. Dashboards consolidate an entire system into a bird’s-eye-view and put the entire health of the system at the operator’s fingertips. Colors should remain simple in dashboards: red for critical alerts, amber for warnings and green for OK. In cases where systems are too time-sensitive, just using two colors, red and green is also OK (as sometimes, warning thresholds start getting ignored). ITRS and Introscope provide dashboarding capability (and so does Nagios).

Dashboards should start with the end in mind. For example, I always like for teams to start with system architecture diagrams. Once they’ve drawn the diagram in the tool, they work backwards to determine which alert will give them the necessary information about the various components they’re interested in monitoring.

Another suggestion I’ll make is that Prod Support teams should continuously and organically work on cleaning up false-positives. False positives eventually make teams ineffective as the team can’t tell the good from the bad alerts, and everything starts getting ignored eventually.

Thursday, August 8, 2013

The Uptime Carve Knife: Calculating Availability

As mentioned in my previous post The Purpose of Production Support, availability is what Production Support is all about. Availability is always expressed as a percentage of uptime. Uptime is the amount of pre-determined time that an application will be enabled to conduct business. Generally, uptime is expressed in minutes. For example, if an application runs 24x7, the total uptime would be 24 hours * 7 days = 168 hours, which expressed in minutes is equal to 168 * 60 = 10080 minutes. Achieving 100% availability means the application was available for the entire uptime period.


From stage left now comes Mr. Outage to rob your app of precious uptime minutes. He’ll pull out his sharp knife and carve out a slice of uptime. Keep in mind that any issue that has business-level impact, qualifies for lost uptime minutes. Let’s say the outage lasts 1 hour or 60 minutes (which, of course, would never happen to your team, since you’re applying the information in this Blog). This means that your application was now available only 10080 - 60 = 10020 minutes. This is equal to 10020/10080 = 99.40%, which is your availability.

The example above assumes that there was a full outage, where no transactions could be made in the application. Availability, however, need not be an all or nothing proposition. Suppose your application services several geographic areas, with varying amounts of business users. Let’s say the breakdown is as follows, 50% AMRS, 40% EMEA and 10% APAC. What happens to our availability number if the outage carving knife is pulled out only during a time which impacts the APAC region? Now our availability impact is really only 10% of total, and the outage minutes can be adjusted. The adjusted minutes, if we follow the scenario above would now be 60 * 0.10 = 6. The availability number now looks much better 10080 – 6 = 10074, which totals 10074/10080 = 99.94%. Voila! We’ve achieved 3-9’s!

A Prod Support team needs to determine what the various services that an application provides are. The team also needs to determine what percentage of uptime a particular service being out impacts. The percentage of user base needs to also be taken into account. Let me continue with the example above.

Suppose, now, that the outage was caused by a data feed not arriving on time. Let's say that because of this missing feed, the users can perform 80% of their activities, but not all (20%). Again, the outage is happening in APAC, so our adjusted impact in minutes is 6. Of that 6, the users could really do about 80% of the work, so we adjust that value and only account for the 20% worth of impact (6 * 0.20 = 1.2 minutes). Now our availability number looks like: 10080 – 1.2 = 10078.8, which equals 99.99%. We’re a 4-9’s shop!

What if you have multiple applications, how do we account for them as a whole. Most folks assume that an average would be an appropriate way to account for total availability. But if you remember from your 5th grade arithmetic class, this is not the way to add percentages. So, how do we do it?

Suppose you have 6 applications with availabilities as follows: 57% + 63% + 56% + 49% + 65% + 78%.

The total availability “points” your applications would achieve is 100 each or 600 (which would mean 100% availability). If we add the points up, you’ll note that they add up to 368, which out of 600 = 61.66%. This is your total availability.

The average of those same numbers yields a value of 61.33%. What this number tells you is, on average, how available your systems are. But not how they provide availability, as a whole.

Wednesday, August 7, 2013

Mayhem is Everywhere: Incident Management

Do you hear that sound? That's the sound of your application coming to a screeching halt because the "new guy" (should we call him Mayhem?) just decided to install a new test server. It just so happens that the server is configured with a duplicate IP address, bringing an entire subnet (and about 40 servers) down. *Any similarities to real events are not intended.


So what now? No insurance company (not even Allstate) can protect you from this type of mayhem, so you're on your own Prod Support guy! But you're prepared, of course, because you worked very diligently to define an Incident Management Process.

What does a good Incident Management Process look like? Good Incident Management processes handle two key pieces: Service Restoral and Communications. The former is likely more intuitive than the latter, but both are critical. Communications give your stakeholders confidence that service is being restored and some insight into why it might be difficult to recover from certain situations.

When an incident is first detected, the very first thing that Production Support groups must do is acknowledge that there is an incident and begin communicating with stakeholders: business partners, technology partners and external clients, if necessary. An initial acknowledgement (email) should be relatively easy to generate as it doesn't really require much detail. It's basically just a note that provides confidence to stakeholders that someone is looking into the issue with the appropriate level of urgency. Another important part of this initial step is to open a ticket (or add some kind of entry) in a tracking system, to ensure that we'll be able to keep metrics about the incident. A good best-practice is to put the ticket ID in the initial acknowledgement e-mail and in follow-up communications.

Once the acknowledgement is out the door (or in parallel) the incident triage and troubleshooting begins. In most shops where I've been, this is done via an incident conference call (or bridge). In many cases, the bridge convenes even before the acknowledgement is out the door. At this stage the severity of the incident is assessed and the right resources are called to help resolve the situation. One useful thing to do is to add all participants to a group chat so that it's easier to share logs, server names and other information that might be difficult to communicate via phone.

Regular updates to stakeholders of the results of the investigation and resolution steps should be sent out. I recommend that those should go out every 15 minutes - however, I've seen some shops that do this on an as-needed or even an hourly basis.

Email updates should follow a standard template. When designing the template, consideration should be given to the fact that many users are looking at the updates via hand held devices (which might not be able to render complicated tables or graphic-intensive html).

Once the issue is resolved, a clear "Service Restored" notice should go out to the stakeholders to let them know that the system is available to them once again.

We're not done yet! Once the service is restored, an assessment of availability should be performed to determine how much the incident has made us deviate from the 100% availability mark. Also, a root-cause investigation should be performed and the whole incident should be summarized in a formal Post-Mortem or Executive Summary. I'll cover more on calculating availability impact and on executive summaries in another post.

At this point, the Problem Management process begins.