Production Support Blog

Monday, March 9, 2015

Monitor Your Interfaces Carefully

We had one of those issues that just sneaks up on you. Just when you think you're covered from a monitoring perspective, an eye-opening issue reminds you that good monitoring is an ever-evolving process. So, what happened?... System A sends a trade with an identifier. System B receives that trade and does several things with it, including creating a new account if one doesn't already exist. System A in an in-house application. System B is a vended application. System B has a limit in terms of how many characters the identifier can contain (7-characters). System A has been sending 7-characters or less until, of course, it reached 10,000,000...Boom! Why didn't the monitoring catch it? Monitoring is typically built on a per-application basis. Support teams typically monitor things like space, memory, CPU, etc. Folks might even set up alerts to monitor business transactions and how well they're progressing. What I see very little of is monitoring that ensures that interfaces are being respected and that situations like the ones above won't occur. In this case, we had been monitoring to make sure System B wasn't overflowing the number field in the DB. However, we hadn't been monitoring from the application standpoint to make sure that some other application wasn't overflowing the application limits. Monitoring is something that should be build in a phased approach: hardware, processes, business transactions, middleware, etc. Don't forget "interfaces." If you do, things will likely go BOOM sooner rather than later!

Monday, November 11, 2013

A Recap of all Critical Support Processes (Part 1)

This blog centers on Support processes which are critical. There might be more, but I've compiled a list of those which I consider key. This first part focuses on the top 10. Doing these will make your team good. The second part will cover all others. If you do those in addition, your team will be great!.

So without further delay:

Process Name	Description	Why you need it
Major Incident Management	Managing issues with the appropriate communications level.	This is how you fight fires.
Monitoring & Alerting	Identifying the health of your applications from a hardware, middleware and business level.	This is how you identify issues with your applications.
Capacity Planning	Identifies whether applications can handle increasing volumes.	You're able to predict how volumes will affect your applications.
Problem Management	After root cause analysis, this process tracks follow up actions.	This is how you get proactive and prevent issues from happening again.
Change Management	Tracks all changes going into the environment.	Changes bring instability.
CRM	This is how you track and manage your customer relationships.	You know who to contact/tell about issues.
Stability Projects	These are short-term projects that help resolve recurring issues.	These supplement your Problem Management program.
Handovers	Teams finishing a shift transfer accountability to those on the next shift.	Ensures that work is done following-the-sun.
Production Transitions	Process to keep the Support team up to speed upon new releases.	Ensures your team's application knowledge doesn't become stale.
Disaster Recovery Planning	Process that outlines how to recover applications during a disaster.	Ensures you can continue to do business during a major disaster event.

Wednesday, October 30, 2013

How To Transition from Support to Development

One of the trends that I see in forum discussions is Support staff asking questions about becoming a Developmer. It seems like many Support staff at some point are looking to make a transition to Development.

I've talked about this before in other posts. And although my advice is to look for Development jobs if you want to be a Developer, there is a way for Support folks to eventually make a transition to Dev.

As with any job, this transition will require that you build competency in your technical abilities and that you build a network of contacts that will help you make a successful jump.

The one thing I can say about building a good network with Developers, for Support staff, is to build solid relationships with your own Dev team. I'm sure this is not a ground-breaking revelation. A good way to do this is to learn the application well: which involves formal and informal knowledge transfer from guess who? Be responsive to issues and lend a helping hand when the Dev team needs help to get their work accomplished. If invited, make an effort to attend social functions that involve Dev staff. Knowing folks outside of work settings is always beneficial when building networks. Remember that these same Development folks might be able to point you to internal or external openings.

In terms of building your technical abilities, as a Developer, there are several ways to accomplish this. Being that your day-to-day doesn't require you to have a heavy Dev focus, you'll need to find a way to keep your skills sharp (and to build new skills) outside of work. There are several ways I can suggest for you to do this.

Take on freelance work. This is better than having a pet project at home, in that you'll have requirements and deadlines you'll need to meet. You won't just toss these projects to the side when your buddies invite you to have a beer. Take on projects that will challenge your comfort zone but, yet, give you a fighting chance to deliver. Solid results and good feedback is always welcome in your resume and on networking sites.

Another possibility is to take a college course or two. Many companies, these days, will actually pay for your tuition. So take adavantage of those programs, which can provide a solid foundation when building new skills. Again, the good thing about taking college classes is that you have deadlines to meet. Given the price of the courses, you also won't just toss them aside.

Support efforts also benefit from efficiency. Building tools to make the support effort more efficient is always a good challenge to take on. And you can do this while at work, if your extracurricular life doesn't allow for taking on side projects.

And yes, it's OK to take on pet projects of your own. Try to incorporate new things in them. For example, if you don't know how to use patterns, try to learn about them and implement your code using those designs. Incorporate best practices like code refactoring. These types of exercises will make your brain more efficient at applying these techniques when the time comes. Keeping your skills sharp will help if you want to transition to Development and it will also make you a better Support resource until then.

Friday, October 25, 2013

Keep It Together: Stability Reviews

Periodic reviews of what's been happening with your system(s) are a critical part of ensuring maximum availability. Every system will eventually become obsolete: volumes will increase, better hardware and technology will become available, software stops being supported, etc. Thus, you want to be ahead of the curve when it comes to reviewing your system and determining whether it's time to tune or change something about it.

A thorough stability review for your system should happen at least twice a year, though I suggest you do this on a quarterly basis. Most companies don't perform stability reviews until something is wrong with the system (when they've had multiple recurring issues). Needless to say, that's not the best approach, though you should never let a good crisis go to waste.

So what should the stability review entail? Every architectural component should be reviewed for improvements. For example: the application, the middleware, the databases, the network, the application hardware, upstream and downstream dependencies, etc.

The way I like to prepare for a stability review is to look at an architecture diagram. I list every single component that shows up on the diagram. Then I organize recurring meetings with the owners of each component to discuss what can be done to improve resiliency. This will become your working group. I generally find that getting buy-in to perform this type of assessment is easy. It makes sense, since most people would rather proactively resolve issues, than work on them in a middle-of-the-night firecall.

When you meet, ask each stakeholder what can be changed for improvement. For example, ask the network team whether all interfaces are redundant and whether they will seamlessly failover when something goes wrong. Talk to your DBAs to ensure that your DB is optimally tuned for the amount of data you have (Do you need to purge? Do your execution plans need to change? Are the right DB parameters set to ensure maximum throughput?). Discuss with the Development team whether things can be improved (Is seamless, automatic failover between redundant components a possibility? Can a graceful way of shutting down the application, to ensure maximum transaction safety, be coded? Can dependencies on lengthy batch feeds be removed or reduced?). Review capacity planning reports to ensure each section of the system will be able to handle the application volume.

As you review each component, action items will come up. It's important to set expectations when you kick off your stability assessment regarding the turnaround for actions. Try to get everything in place is 1-2 months. Don't let activities drag on, as otherwise, the risk of something going wrong becomes higher. There's also risk that things will never get completed.

Follow up on the action items on a weekly basis with your working group. As the actions start being implemented, your application resiliency will get better. And your confidence around your application stability will significantly grow.

Tuesday, October 22, 2013

Saying No (Even To The Business)

Many of the posts on this blog have focused on enabling the business and keeping the business in mind as we carry out Support tasks. So, you might find it strange that I'm dedicating a post to saying "no" to the same business users we purportedly help. Let me provide an example where I've directed my teams to say no to the business. But note how we did it and whether you agree or disagree.

One of the managers that reports to me brought up a concern. His team, for historical reasons, had been helping the business with even the most menial of requests. He was trying to determine how to stop these types of requests as they were robbing his team of valuable bandwidth that could best be utilized for more value-added tasks. For example, they would get calls from business users asking them why their printer wasn't working or asking them to reset their LAN id. There is a helpdesk that manages these requests, yet they were reaching out to Support to work on these tasks.

I asked him to put a meeting together with the business department head so we could talk about these requests. Ahead of the meeting, we prepared a couple of slides identifying Support Effort and where it was going. We determined that about 10% of our total Support bandwidth was dedicated to servicing menial requests that the business users should have been able to handle. We also identified various areas for improvement that we would otherwise be able to accomplish, were we not spending time, say, fixing printers.

The meeting came and we presented our case to the business partner. He was in total agreement that we should be spending our time doing things like automating manual tasks or adding better monitoring, rather than resetting passwords or clearing out blockages in a printer. We asked if he could send us an e-mail with his expectations around which items we would no longer be servicing.

When the next request came around, we responded to the business user that they should contact the help desk and we provided instructions on how to do so. We also attached the e-mail from the business head explaining that we would no longer be working on such requests. Some of the users were less than thrilled, of course. However, they eventually understood the reasons behind our inability to service those requests. Eventually, the requests stopped.

In servicing the business, we have to realize that our bandwidth comes at a premium. And it's actually in the business' best interest to have us focus on value added tasks. It's important to maintain our posture as a group and not become a dumping ground for issues that people would just not rather deal with. In this scenario, we made our case, and actually showed the business that they were better off by us not spending time on these requests. As paradoxical as this sounds, this actually helped us support them better.

Wednesday, October 16, 2013

To Patch Or Not To Patch

You've been through this before: those weeks where day after day the same issue strikes and despite your best effort at determining a root cause, you can't come to it. You've increased logging and are going through the files with a fine tooth comb; all to no avail.

Then, in a glorious burst of inspiration...Finally...You find it. It's staring right at you. It's a bug! You can fix that. In fact you have the fix, but guess what? It's Monday. Not only that, in order to deploy the fix, you'll have to bring down (impact) your 24x5.5 system for a while. When the bug presents itself, it's a lot of work (usually at around 2 AM) to correct the situation it creates. It involves updating data manually, which could be potentially dangerous. So, should you impact your system availability and patch, meaning you get to sleep and avert the risk of errors? Or do you wait until the weekend and attempt to hold the fort?

The answer to this is all about risk management, which is one of the primary goals of a support team (read The Purpose of Production Support). Patching (change) involves inherent risk. Making a change to your environment could potentially have impacts beyond what you're trying to correct. For example, what if that bug you found requires correcting a common library (meaning your have to recompile a good number of binaries).

One of the questions you should be asking yourself, too, is, how thorough was the testing? Many times, it's impossible to perform a full set of regression tests before the change has to go in.

In the scenario above, we're also subject to risks extraneous to technology. For example, what if your system is a financial trading system and an outage means your business users are unable to take advantage of a favorable move in the market?

The scenario above provides a fairly well known workaround. Another question that comes up then is, can the risks inherent to this workaround be mitigated. For example, is it possible to automate a set of SQL queries that will reduce the potential for manual errors?

The ability to identify when the bug strikes is also an important risk factor. If identifying the error is straight-forward and we have an automated workaround, then the risk becomes much lower.

So the answer of whether "To Patch or Not to Patch" requires some inquiry into many factors. Each situation will be different, with its own set of urgency as well its business and technical nuances. But asking these questions and struggling through them to make the right decision is a sign of a mature Support team.

Think about these various factors the next time you're faced with a dilemma like this. You might very well conclude that the best approach is to push the patch out a few more days. On the other hand, you might decide that the workaround is too risky to continue with, meaning you have no choice but to install the patch.

Friday, September 27, 2013

How to hire Support staff.

Hiring the right personnel is one of the most critical things you can do as a Support manager. In an era of small budgets, you need all the people you can get and you need them firing on all cylinders. Otherwise, you'll spend the next six months trying to teach your new hire about your application, your processes, your business, etc. just to find out they "don't get it."

So how do you ensure you hire people that "get it?" There are several things you can do to ensure hiring success:

A good phone screen. I like to split up phone screens into various parts. First I explain the role and what I'm looking for in a candidate. Second, I have the candidate take me through their resume. I leave this open-ended so I can also assess their communications skills. I also look for them to explain how their past experience makes them a good fit. Third, I ask them a bunch of short answer technical questions, e.g. "What command would you use to find a string in a log file?" Any question regarding key words they put in their resume is fair game. For example, if you say you know C, you'd better know what the static key word in C means (hint: it's not the same as in C++). Finally, I give them the opportunity to ask me questions.
Fair questions in face-to-face interviews. I generally ask three technical questions when I interview face-to-face. These are all written/on-the-board exercises. I ask a SQL question, usually how to perform an implicit join. I ask a scripting question related to the relevant OS for the role. I also ask them to write a short program in any compiled or scripting language they're comfortable with. For example, "How would you implement a function to write a string backwards?" The idea is not to "stump" people or to show off your own knowledge, but to assess whay they know.
Answer your own question: "Am I able to work with this person?" This might perhaps be the most important piece of the puzzle. Even if a person has all the skills they need for the role, they'll need also need to have the right personality to get hired. You know your group's personality and culture. Never hire someone who won't fit in. You'll only end up losing them as soon as you're done training them.

The above pieces of information about a candidate should give you enough insight into their skills and their personality. You can ask more, or you can ask less. But for me, what I outlined above has become a winning formula.