Developers are sometimes too helpful when they instrument their systems. For example, when asked to add a metric that will report the response time of a request – there are several ways that it can be done. One way that seems to make sense is to just keep a variable with the total number of requests, and another with the total processing time. Then the developer just creates a variable showing total processing time divided by total requests, and a way to expose it (an MBean to report it via JMX, or a status page via HTTP, etc). This will be a nice neat object that reports the response time in milliseconds, all pre-calculated for the user.
The problem with this? It is indeed going to report the average response time – but it’s going to be the average of all response times since the server started. So… if the server has been running with an average response time of 1 ms, and it’s been up for 1000 hours, then it starts exhibiting a response time of 100 ms per request – after an hour of this slow behavior, the pre-calculated average response time will be 1.01 milliseconds (assuming a constant rate of requests). Not even enough of a change to be discernible with the eye on a graph, even though the system is now 100 times slower. We’ve seen exactly this implementation a few times (most recently by one of our own developers instrumenting some of our own processes, but also in ZooKeeper and other servers) and it makes the response time metrics useless for monitoring or alerting purposes. (And why else are they there?)
Another choice? Pre-computing the metric (response time, CPU load, etc) over a pre-defined time period (e.g. average response time for all requests received over the last 2 minutes). While this is a better choice than averaging since the server started, it’s still not the most helpful. What if I want to sample at a different frequency? If I sample at one minute intervals, I’ll have a rolling average of the last two minutes each time I sample, so my monitoring will not be as responsive to system changes as I would expect. If I sample at 5 minute intervals, I’ll only be getting information about the prior 2 minutes. There is no information about the 3 minutes before that, so if in that 3 minutes there was a huge response time spike – I’ll be clueless.
So what’s the preferred way for a developer to add instrumentation? Developers should do less work. The server should expose the metrics needed to calculate response time, etc, but not precompute anything – leave that to the monitoring systems. After all, that’s what they were designed to do.
So, to report response time – expose a counter that shows the total time spent processing, since server startup, for all requests. Show another counter that shows the total number of requests processed. That’s it. (Tomcat’s Mbeans do exactly this.)
Then the monitoring system can figure out the response time for any interval it wants. The monitoring system will get the values of total processing time at time T1 and then again at time T2. Similarly, it will get the total requests at times T1 and T2. Then, no matter what the interval between T1 and T2 is, the average response time for that period is:
(Total Processing Time at time T2 – Total Processing Time at time T1)/(Total Requests at time T2 – Total Requests at time T1).
This method means that if I choose to sample at a low frequency, such as every 5 minutes, and there is a large spike in the first 3 minutes of the sample – this will show up in the 5 minute average. I can sample as frequently as I like, and get data exactly representative of the period I am covering.
While this method may seem not to be as convenient as precomputing the values, it provides far more general utility. So developers – quit trying to help so much. Your monitoring systems will thank you!