Available for Consulting

Need a job? I might be able to help you find one. Need help? I'm available for consulting engagements. Send me an e-mail. Or you can contact me via Google+ or Linked In.

Thursday, August 8, 2013

The Uptime Carve Knife: Calculating Availability

As mentioned in my previous post The Purpose of Production Support, availability is what Production Support is all about. Availability is always expressed as a percentage of uptime. Uptime is the amount of pre-determined time that an application will be enabled to conduct business. Generally, uptime is expressed in minutes. For example, if an application runs 24x7, the total uptime would be 24 hours * 7 days = 168 hours, which expressed in minutes is equal to 168 * 60 = 10080 minutes. Achieving 100% availability means the application was available for the entire uptime period.


From stage left now comes Mr. Outage to rob your app of precious uptime minutes. He’ll pull out his sharp knife and carve out a slice of uptime. Keep in mind that any issue that has business-level impact, qualifies for lost uptime minutes. Let’s say the outage lasts 1 hour or 60 minutes (which, of course, would never happen to your team, since you’re applying the information in this Blog). This means that your application was now available only 10080 - 60 = 10020 minutes. This is equal to 10020/10080 = 99.40%, which is your availability.

The example above assumes that there was a full outage, where no transactions could be made in the application. Availability, however, need not be an all or nothing proposition. Suppose your application services several geographic areas, with varying amounts of business users. Let’s say the breakdown is as follows, 50% AMRS, 40% EMEA and 10% APAC. What happens to our availability number if the outage carving knife is pulled out only during a time which impacts the APAC region? Now our availability impact is really only 10% of total, and the outage minutes can be adjusted. The adjusted minutes, if we follow the scenario above would now be 60 * 0.10 = 6. The availability number now looks much better 10080 – 6 = 10074, which totals 10074/10080 = 99.94%. Voila! We’ve achieved 3-9’s!

A Prod Support team needs to determine what the various services that an application provides are. The team also needs to determine what percentage of uptime a particular service being out impacts. The percentage of user base needs to also be taken into account. Let me continue with the example above.

Suppose, now, that the outage was caused by a data feed not arriving on time. Let's say that because of this missing feed, the users can perform 80% of their activities, but not all (20%). Again, the outage is happening in APAC, so our adjusted impact in minutes is 6. Of that 6, the users could really do about 80% of the work, so we adjust that value and only account for the 20% worth of impact (6 * 0.20 = 1.2 minutes). Now our availability number looks like: 10080 – 1.2 = 10078.8, which equals 99.99%. We’re a 4-9’s shop!

What if you have multiple applications, how do we account for them as a whole. Most folks assume that an average would be an appropriate way to account for total availability. But if you remember from your 5th grade arithmetic class, this is not the way to add percentages. So, how do we do it?

Suppose you have 6 applications with availabilities as follows: 57% + 63% + 56% + 49% + 65% + 78%.

The total availability “points” your applications would achieve is 100 each or 600 (which would mean 100% availability). If we add the points up, you’ll note that they add up to 368, which out of 600 = 61.66%. This is your total availability.

The average of those same numbers yields a value of 61.33%. What this number tells you is, on average, how available your systems are. But not how they provide availability, as a whole.

No comments:

Post a Comment