Available for Consulting

Need a job? I might be able to help you find one. Need help? I'm available for consulting engagements. Send me an e-mail. Or you can contact me via Google+ or Linked In.

Tuesday, September 24, 2013

Are You Sure About Your Monitoring?

Today we had an embarrassing issue happen. It started at 1:00 AM and we didn't catch the problem until 10:00 AM when business users reported they were missing some data. So, basically, we went about half a day without knowing something was wrong. As it turns out, we had a monitoring gap.

A log monitor which captures certain strings in the file did not capture one of the strings it was configured for. Here's the timeline of events of why it didn't capture the error:
  1. The monitor was set up to tail the log file every 5 minutes to capture everything in the log since the last time the monitor ran. This is by design with a vended application we use for monitoring.
  2. The monitor ran at 12:59 PM and didn't find any errors.
  3. The error comes in at 1:00:59 AM with the string "LOG EXCEPTION"
  4. The log file rolls because it has a size limitation.
  5. The monitor runs at 1:04 AM and tails the file again, but the error is now in the rolled file.
  6. 10:00 AM, the business user reports the problem

Gotcha! Clearly we missed this in our thinking when we set up the monitor. We've now configured our monitoring to always look at the last two log files. Since the files don't grow too quickly, that should suffice (given the 5 minute interval).

So if you use Sitescope, keep this in mind. Don't get caught with your pants down. I'm sure by now I've lost everyone who uses Sitescope for monitoring (they're now checking they don't have similar gaps). :-)

Cheers!

No comments:

Post a Comment