Production Support Blog: October 2013

Wednesday, October 30, 2013

How To Transition from Support to Development

One of the trends that I see in forum discussions is Support staff asking questions about becoming a Developmer. It seems like many Support staff at some point are looking to make a transition to Development.

I've talked about this before in other posts. And although my advice is to look for Development jobs if you want to be a Developer, there is a way for Support folks to eventually make a transition to Dev.

As with any job, this transition will require that you build competency in your technical abilities and that you build a network of contacts that will help you make a successful jump.

The one thing I can say about building a good network with Developers, for Support staff, is to build solid relationships with your own Dev team. I'm sure this is not a ground-breaking revelation. A good way to do this is to learn the application well: which involves formal and informal knowledge transfer from guess who? Be responsive to issues and lend a helping hand when the Dev team needs help to get their work accomplished. If invited, make an effort to attend social functions that involve Dev staff. Knowing folks outside of work settings is always beneficial when building networks. Remember that these same Development folks might be able to point you to internal or external openings.

In terms of building your technical abilities, as a Developer, there are several ways to accomplish this. Being that your day-to-day doesn't require you to have a heavy Dev focus, you'll need to find a way to keep your skills sharp (and to build new skills) outside of work. There are several ways I can suggest for you to do this.

Take on freelance work. This is better than having a pet project at home, in that you'll have requirements and deadlines you'll need to meet. You won't just toss these projects to the side when your buddies invite you to have a beer. Take on projects that will challenge your comfort zone but, yet, give you a fighting chance to deliver. Solid results and good feedback is always welcome in your resume and on networking sites.

Another possibility is to take a college course or two. Many companies, these days, will actually pay for your tuition. So take adavantage of those programs, which can provide a solid foundation when building new skills. Again, the good thing about taking college classes is that you have deadlines to meet. Given the price of the courses, you also won't just toss them aside.

Support efforts also benefit from efficiency. Building tools to make the support effort more efficient is always a good challenge to take on. And you can do this while at work, if your extracurricular life doesn't allow for taking on side projects.

And yes, it's OK to take on pet projects of your own. Try to incorporate new things in them. For example, if you don't know how to use patterns, try to learn about them and implement your code using those designs. Incorporate best practices like code refactoring. These types of exercises will make your brain more efficient at applying these techniques when the time comes. Keeping your skills sharp will help if you want to transition to Development and it will also make you a better Support resource until then.

Friday, October 25, 2013

Keep It Together: Stability Reviews

Periodic reviews of what's been happening with your system(s) are a critical part of ensuring maximum availability. Every system will eventually become obsolete: volumes will increase, better hardware and technology will become available, software stops being supported, etc. Thus, you want to be ahead of the curve when it comes to reviewing your system and determining whether it's time to tune or change something about it.

A thorough stability review for your system should happen at least twice a year, though I suggest you do this on a quarterly basis. Most companies don't perform stability reviews until something is wrong with the system (when they've had multiple recurring issues). Needless to say, that's not the best approach, though you should never let a good crisis go to waste.

So what should the stability review entail? Every architectural component should be reviewed for improvements. For example: the application, the middleware, the databases, the network, the application hardware, upstream and downstream dependencies, etc.

The way I like to prepare for a stability review is to look at an architecture diagram. I list every single component that shows up on the diagram. Then I organize recurring meetings with the owners of each component to discuss what can be done to improve resiliency. This will become your working group. I generally find that getting buy-in to perform this type of assessment is easy. It makes sense, since most people would rather proactively resolve issues, than work on them in a middle-of-the-night firecall.

When you meet, ask each stakeholder what can be changed for improvement. For example, ask the network team whether all interfaces are redundant and whether they will seamlessly failover when something goes wrong. Talk to your DBAs to ensure that your DB is optimally tuned for the amount of data you have (Do you need to purge? Do your execution plans need to change? Are the right DB parameters set to ensure maximum throughput?). Discuss with the Development team whether things can be improved (Is seamless, automatic failover between redundant components a possibility? Can a graceful way of shutting down the application, to ensure maximum transaction safety, be coded? Can dependencies on lengthy batch feeds be removed or reduced?). Review capacity planning reports to ensure each section of the system will be able to handle the application volume.

As you review each component, action items will come up. It's important to set expectations when you kick off your stability assessment regarding the turnaround for actions. Try to get everything in place is 1-2 months. Don't let activities drag on, as otherwise, the risk of something going wrong becomes higher. There's also risk that things will never get completed.

Follow up on the action items on a weekly basis with your working group. As the actions start being implemented, your application resiliency will get better. And your confidence around your application stability will significantly grow.

Tuesday, October 22, 2013

Saying No (Even To The Business)

Many of the posts on this blog have focused on enabling the business and keeping the business in mind as we carry out Support tasks. So, you might find it strange that I'm dedicating a post to saying "no" to the same business users we purportedly help. Let me provide an example where I've directed my teams to say no to the business. But note how we did it and whether you agree or disagree.

One of the managers that reports to me brought up a concern. His team, for historical reasons, had been helping the business with even the most menial of requests. He was trying to determine how to stop these types of requests as they were robbing his team of valuable bandwidth that could best be utilized for more value-added tasks. For example, they would get calls from business users asking them why their printer wasn't working or asking them to reset their LAN id. There is a helpdesk that manages these requests, yet they were reaching out to Support to work on these tasks.

I asked him to put a meeting together with the business department head so we could talk about these requests. Ahead of the meeting, we prepared a couple of slides identifying Support Effort and where it was going. We determined that about 10% of our total Support bandwidth was dedicated to servicing menial requests that the business users should have been able to handle. We also identified various areas for improvement that we would otherwise be able to accomplish, were we not spending time, say, fixing printers.

The meeting came and we presented our case to the business partner. He was in total agreement that we should be spending our time doing things like automating manual tasks or adding better monitoring, rather than resetting passwords or clearing out blockages in a printer. We asked if he could send us an e-mail with his expectations around which items we would no longer be servicing.

When the next request came around, we responded to the business user that they should contact the help desk and we provided instructions on how to do so. We also attached the e-mail from the business head explaining that we would no longer be working on such requests. Some of the users were less than thrilled, of course. However, they eventually understood the reasons behind our inability to service those requests. Eventually, the requests stopped.

In servicing the business, we have to realize that our bandwidth comes at a premium. And it's actually in the business' best interest to have us focus on value added tasks. It's important to maintain our posture as a group and not become a dumping ground for issues that people would just not rather deal with. In this scenario, we made our case, and actually showed the business that they were better off by us not spending time on these requests. As paradoxical as this sounds, this actually helped us support them better.

Wednesday, October 16, 2013

To Patch Or Not To Patch

You've been through this before: those weeks where day after day the same issue strikes and despite your best effort at determining a root cause, you can't come to it. You've increased logging and are going through the files with a fine tooth comb; all to no avail.

Then, in a glorious burst of inspiration...Finally...You find it. It's staring right at you. It's a bug! You can fix that. In fact you have the fix, but guess what? It's Monday. Not only that, in order to deploy the fix, you'll have to bring down (impact) your 24x5.5 system for a while. When the bug presents itself, it's a lot of work (usually at around 2 AM) to correct the situation it creates. It involves updating data manually, which could be potentially dangerous. So, should you impact your system availability and patch, meaning you get to sleep and avert the risk of errors? Or do you wait until the weekend and attempt to hold the fort?

The answer to this is all about risk management, which is one of the primary goals of a support team (read The Purpose of Production Support). Patching (change) involves inherent risk. Making a change to your environment could potentially have impacts beyond what you're trying to correct. For example, what if that bug you found requires correcting a common library (meaning your have to recompile a good number of binaries).

One of the questions you should be asking yourself, too, is, how thorough was the testing? Many times, it's impossible to perform a full set of regression tests before the change has to go in.

In the scenario above, we're also subject to risks extraneous to technology. For example, what if your system is a financial trading system and an outage means your business users are unable to take advantage of a favorable move in the market?

The scenario above provides a fairly well known workaround. Another question that comes up then is, can the risks inherent to this workaround be mitigated. For example, is it possible to automate a set of SQL queries that will reduce the potential for manual errors?

The ability to identify when the bug strikes is also an important risk factor. If identifying the error is straight-forward and we have an automated workaround, then the risk becomes much lower.

So the answer of whether "To Patch or Not to Patch" requires some inquiry into many factors. Each situation will be different, with its own set of urgency as well its business and technical nuances. But asking these questions and struggling through them to make the right decision is a sign of a mature Support team.

Think about these various factors the next time you're faced with a dilemma like this. You might very well conclude that the best approach is to push the patch out a few more days. On the other hand, you might decide that the workaround is too risky to continue with, meaning you have no choice but to install the patch.

Production Support Blog

Available for Consulting