Your partial failures may be complete failures to your customers.

Can LogicMonitor, a developer of datacenter server monitoring, where we monitor everything in all sorts of ways, have an undetected customer affecting issue?  Yes. We just had an issue that was reported by some trial customers, before our techops team was aware of it. Even worse, after techops thought they had addressed the issue, the customers were still affected. How?

In this case, the issue was that for some customers, UI performance sucked – taking tens of seconds to render a page. Yet confusingly, other customers, whose UI was being served by the same server, hitting the same MySQL servers, had great response time.  (Lots of apologies to those accounts affected.)

Our monitoring didn’t detect anything amiss: the CPU loads, application specific stats, MySQL stats about query cache usage, number of transactions, full table joins, etc,  all looked normal. Even Tomcat response time per request stayed nice and fast for the servers:

So what was going on?

A fair bit of digging revealed that the same query on MySQL in some accounts took 50 seconds; the exact same query, in other accounts (different MySQL databases on the same MySQL engine) took only milliseconds. Exact same schemas. But – a different query plan was generated for the slow databases, than for the fast ones.  Exact same query, exact same schema, orders of magnitude different performance. (And the slow query plan was generated for databases that were both very small, and large, compared to ones with the correct query plan, so we’re unclear as to why the optimizer was getting it wrong sometimes.) Now, we can fix the fact that MySQL was optimizing the queries incorrectly (we’re switching to a STRAIGHT_JOIN to force the correct optimization, and testing shows MySQL 5.6 does not have the issue) – but how do we ensure this does not reoccur? (After all, we don’t want to make one of the top monitoring mistakes….)

The issue affected only a few accounts, so that the average database, tomcat, and other aggregated metrics were barely affected. So monitoring those did not help. (Note: Had Tomcat, MySQL, etc tracked the distribution of response times and other statistics, instead of just the average – then monitoring that distribution could have alerted us.)

But MySQL does have a slow-query log. We do have that enabled on some servers, but it’s been regarded as an optimization and trouble-shooting tool, used to improve code or to help with troubleshooting once an issue is identified.  But – any time MySQL is taking seconds to complete a query, that will be impacting one of our customers. Even if in the aggregate, the response time is fast, if some customers are having poor performance, we need to know, and be alerted on that.

Consequently, our Tech Ops team is now rolling out active monitoring of the slow query log by LogicMonitor to all production MySQL instances, alerting us anytime the query log contains queries that exceed the specified threshold.

The lesson?  Monitor as much as you can, but you are likely to still have issues you will not adequately detect. Just be sure that when you do encounter such an issue, you adjust your monitoring to alert you, so it cannot happen again. To paraphrase a saying: “Give me an undetected issue once – shame on you. Give me an undetected issue twice – shame on me.”