One of the most critical responsibilities that an application support team has is that of Application Health Management (see my previous entry
The 6 Managements of Prod Support), also known as Monitoring and Alerting. From experience, I believe monitoring application health has 3 layers (plus a few additional areas that need to be monitored, lest availability be impacted).
These three layers work much like a Jenga stack. As individual components or blocks start coming out of service, system stability starts degrading until eventually the stack comes crashing down. Application Support should be able to know that individual components have been impacted and should be able to take proactive steps to put them back to 100% service.
The three layers I’m talking about are:
1) Machine Health: Total CPU, Total Memory, Total Disk, Total Swap, Network interfaces, etc.
2) Basic Application monitoring: Running processes being up, process memory utilization, basic error and exception checking, smoke tests, etc.
3) Business Process monitoring: Transactions occurring correctly, Transactions occurring within performance SLA, Transaction acknowledgements, Transacation persistence in db, etc. For financial applications: market data feeds, user sessions, pricing, etc.
The other pieces that I didn’t include in these 3 are middleware components. The reason I don’t bundle them with the application is twofold. For one, middleware components such as application servers, databases and queuing systems are not really part of the application itself. Secondly, in most medium and large organizations, the monitoring of system health for shared infrastructure is managed by a separate team.
How to Check your Jenga Stack
Let me make one more comment, and that is how monitoring should be done.
In terms of monitoring, it’s best to use enterprise tools such as
CA APM or
Geneos (ITRS). I’m familiar with these tools, and I can recommend them to teams looking for monitoring tools. Although I don’t have first-hand experience with
Nagios, I hear a lot of good feedback on the tool (and it's mostly free).
I see many teams use tools like the ones mentioned above to capture issues, to then just turn around and send e-mail alerts. I strongly advise against this approach. E-mail quickly becomes unmanageable, especially if alerts and business requests are flying into the same mailbox. I’ve seen way too many missed alerts with this approach, and really, it’s unfair to ask that operators not miss anything.
I once saw a group that had about 2500 e-mails daily coming into their team mailbox. The mailbox contained user requests, team responses, valid alerts and false-positive alerts. No wonder they couldn’t keep up with it!
A better approach to monitoring is to utilize graphical dashboard views. Dashboards consolidate an entire system into a bird’s-eye-view and put the entire health of the system at the operator’s fingertips. Colors should remain simple in dashboards: red for critical alerts, amber for warnings and green for OK. In cases where systems are too time-sensitive, just using two colors, red and green is also OK (as sometimes, warning thresholds start getting ignored). ITRS and Introscope provide dashboarding capability (and so does Nagios).
Dashboards should start with the end in mind. For example, I always like for teams to start with system architecture diagrams. Once they’ve drawn the diagram in the tool, they work backwards to determine which alert will give them the necessary information about the various components they’re interested in monitoring.
Another suggestion I’ll make is that Prod Support teams should continuously and organically work on cleaning up false-positives. False positives eventually make teams ineffective as the team can’t tell the good from the bad alerts, and everything starts getting ignored eventually.