You've been through this before: those weeks where day after day the same issue strikes and despite your best effort at determining a root cause, you can't come to it. You've increased logging and are going through the files with a fine tooth comb; all to no avail.
Then, in a glorious burst of inspiration...Finally...You find it. It's staring right at you. It's a bug! You can fix that. In fact you have the fix, but guess what? It's Monday. Not only that, in order to deploy the fix, you'll have to bring down (impact) your 24x5.5 system for a while. When the bug presents itself, it's a lot of work (usually at around 2 AM) to correct the situation it creates. It involves updating data manually, which could be potentially dangerous. So, should you impact your system availability and patch, meaning you get to sleep and avert the risk of errors? Or do you wait until the weekend and attempt to hold the fort?
The answer to this is all about risk management, which is one of the primary goals of a support team (read The Purpose of Production Support). Patching (change) involves inherent risk. Making a change to your environment could potentially have impacts beyond what you're trying to correct. For example, what if that bug you found requires correcting a common library (meaning your have to recompile a good number of binaries).
One of the questions you should be asking yourself, too, is, how thorough was the testing? Many times, it's impossible to perform a full set of regression tests before the change has to go in.
In the scenario above, we're also subject to risks extraneous to technology. For example, what if your system is a financial trading system and an outage means your business users are unable to take advantage of a favorable move in the market?
The scenario above provides a fairly well known workaround. Another question that comes up then is, can the risks inherent to this workaround be mitigated. For example, is it possible to automate a set of SQL queries that will reduce the potential for manual errors?
The ability to identify when the bug strikes is also an important risk factor. If identifying the error is straight-forward and we have an automated workaround, then the risk becomes much lower.
So the answer of whether "To Patch or Not to Patch" requires some inquiry into many factors. Each situation will be different, with its own set of urgency as well its business and technical nuances. But asking these questions and struggling through them to make the right decision is a sign of a mature Support team.
Think about these various factors the next time you're faced with a dilemma like this. You might very well conclude that the best approach is to push the patch out a few more days. On the other hand, you might decide that the workaround is too risky to continue with, meaning you have no choice but to install the patch.
No comments:
Post a Comment