Periodic reviews of what's been happening with your system(s) are a critical part of ensuring maximum availability. Every system will eventually become obsolete: volumes will increase, better hardware and technology will become available, software stops being supported, etc. Thus, you want to be ahead of the curve when it comes to reviewing your system and determining whether it's time to tune or change something about it.
A thorough stability review for your system should happen at least twice a year, though I suggest you do this on a quarterly basis. Most companies don't perform stability reviews until something is wrong with the system (when they've had multiple recurring issues). Needless to say, that's not the best approach, though you should never let a good crisis go to waste.
So what should the stability review entail? Every architectural component should be reviewed for improvements. For example: the application, the middleware, the databases, the network, the application hardware, upstream and downstream dependencies, etc.
The way I like to prepare for a stability review is to look at an architecture diagram. I list every single component that shows up on the diagram. Then I organize recurring meetings with the owners of each component to discuss what can be done to improve resiliency. This will become your working group. I generally find that getting buy-in to perform this type of assessment is easy. It makes sense, since most people would rather proactively resolve issues, than work on them in a middle-of-the-night firecall.
When you meet, ask each stakeholder what can be changed for improvement. For example, ask the network team whether all interfaces are redundant and whether they will seamlessly failover when something goes wrong. Talk to your DBAs to ensure that your DB is optimally tuned for the amount of data you have (Do you need to purge? Do your execution plans need to change? Are the right DB parameters set to ensure maximum throughput?). Discuss with the Development team whether things can be improved (Is seamless, automatic failover between redundant components a possibility? Can a graceful way of shutting down the application, to ensure maximum transaction safety, be coded? Can dependencies on lengthy batch feeds be removed or reduced?). Review capacity planning reports to ensure each section of the system will be able to handle the application volume.
As you review each component, action items will come up. It's important to set expectations when you kick off your stability assessment regarding the turnaround for actions. Try to get everything in place is 1-2 months. Don't let activities drag on, as otherwise, the risk of something going wrong becomes higher. There's also risk that things will never get completed.
Follow up on the action items on a weekly basis with your working group. As the actions start being implemented, your application resiliency will get better. And your confidence around your application stability will significantly grow.
No comments:
Post a Comment