Why hasn’t AI taken off yet in monitoring?

Algorithmic IT Operations (or AIOps) has great potential to improve the state of complex enterprise monitoring. The addition of dependency data, load correlation, anomaly detection, and other systems will improve the efficacy of today’s monitoring systems substantially. However, true Artificial Intelligence may have less to contribute than other aspects of AIOps.

AI Defined?

There’s a lot of talk about artificial intelligence (AI) and deep learning to taming the vast quantities of data that modern Operations teams and their tools deal with. Analyst reports frequently tout AI capabilities, regardless of how minor, as a strength of a product, and the lack of them as a weakness. Yet no effective use of AI seems to have emerged and claimed wide adoption in Network Operations or Server Monitoring. Why not?

Part of the issue is that AI is a soft definition. As Rodney Brooks, the director of MIT’s Artificial Intelligence Laboratory, says, “Every time we figure out a piece of it, it stops being magical; we say, ‘Oh, that’s just a computation.’ So by definition we never really reach AI. LogicMonitor has a feature that performs numerical correlation among vast amounts of data, looking for patterns and similarities to an identified oddity (e.g. if disk latency on a database increases, it can identify what other metrics had a similar pattern. Website requests? Network retransmissions? Database queries from a QA system? This can narrow down the candidates for the root cause). We don’t think of such a system as AI, because it’s “just” the application of well understood statistical methods. To an operations person from 20 years ago, however, it would definitely seem like intelligence.

Machine Learning as AI

Some techniques are (for the moment, at least) universally recognized as AI. Machine learning is one of them. Machine learning is finding applicability in all sorts of fields that were thought to be the province of “real” intelligence. It can beat humans at chess, at Go, compose symphonies and haiku as good as those generated by poets. So why not operational troubleshooting?
Well, one reason is that supervised machine learning needs to learn, and then follow, rules. It has to be trained on a set of data, such as completed games. From the training set, it can generate a model, and use that model it to apply what it has learned to new games or compositions.

Correct Result, Wrong Reason

One problem with supervised learning in the Ops world is that you can’t tell what rules the AI extracted. It may be getting the correct result, but for the wrong reasons.
The most fun example of this I know of is Clever Hans. Hans was a horse that could do arithmetic. Addition, subtraction, division. Ask the horse a question, and he would indicate when a person counting out loud had reached the correct answer. He would do this reliably, regularly, and repeatedly. The only problem was that the horse wasn’t solving mathematical problems and then listening for the correct number to be said. He was just looking at the people around him, and when their body language indicated the right moment, he’d stomp his foot. He was effectively a neural net that had been trained to give the answer to mathematical puzzles — but Hans was extracting the answers not in the way that was expected. And not in a way that was useful if you actually wanted to rely on Hans’ computational abilities and didn’t know the answer in advance.

Similarly, AI vision recognition systems can be trained to distinguish leopards from cheetahs from ocelots, very accurately. But they also identify couches as leopards. (One could imagine problems if one was relying on these algorithms to defend against leopards…) You can check the work of an AI when you already know the answer — but relying on it when you don’t know the answer makes you depend on whatever cues the system deemed applicable from its training data — and you don’t know what those were. This could be that a device is identified as a likely root cause because it has more than the usual incidence of the letter “P” in its hostname, which may have been an unknown artifact of the training data set.

I would acknowledge that the inability to explain your reasoning doesn’t matter if you get the right answer. Professional bike racers will be able to tell you exactly how fast they can go around a given corner before they lose traction and slide out. So could a physicist. The physicist can show his work, and explain the coefficient of friction, and the lateral and vertical vectors. The cyclist will not be able to explain how he knows — but he will know.

So the lack of insight into an AI’s processes isn’t necessarily an obstacle to their use in operations. Rather, it’s the fact that an AI is limited by its training set. An AI can write a symphony that we enjoy, because it conforms to our current expectations of what an artistic and pleasant symphony should sound like, but it cannot push the boundaries of creativity, creating symphonies that violate rules and create enough emotional impact that they cause riots — then were soon regarded as genteel art.

AI Insights on Monitoring

In IT Operations, good operational practices dictate that most issues are unique. Prior issues should have been addressed in a way that means they won’t recur, or — at the very least — the monitoring has been configured so that it clearly warns of the situation. If neither of those conditions is true, the incident should still be regarded as open. So if the issues are mostly unique, the training sets will not have covered them, and the insights from an AI are unlikely to be terribly insightful or helpful. They may, in fact, be distractions and wild goose chases — like asking Clever Hans to calculate your tax return. (Of course, some known issues do recur – certainly AI can help provide context to such issues, so that less experienced staff can resolve them, without relying on the greater context that more experienced staff may have.)

One way around this would be to mine the information across a diverse set of customer operational data; the fact that I had an issue with Zookeeper that had quorum configuration as the root cause, and I resolved it, may mean that the issue shouldn’t recur for me but that knowledge may be useful to other companies running zookeeper. Of course, there are problems here not only with data privacy but also data labeling and training. If I identify my zookeeper nodes as z1.prod, with tags #zookeeper and #prod and you call yours n34.lax.us.west with tags #quorum and #live — how is commonality to be established so that the training lessons can be applied?

Technology can certainly add value to monitoring systems now — it can cluster alerts from different systems together into single incidents, based on commonalities of time, and (ideally) a knowledge of the topology and dependencies of the system. This can help reduce alert overload and help simplify root cause detection. It is arguable whether this is AI, however. (Note that the ideal state of ‘knowing’ dependencies may not be possible for the system, at least at an individual transaction level, as dependencies can change with load; whether data is cached or not; whenever new code is deployed; or whenever new nodes or containers are added/removed. Note also that changes in dependencies that cause incidents may not be known by the monitoring system (as the communication mechanisms may themselves be disrupted by the incident) so the monitoring may be misidentifying some alerts as related to the incident based on old data. As the Google team puts it “Few teams at Google maintain complex dependency hierarchies because our infrastructure has a steady rate of continuous refactoring.“)

Big Picture AIOps

Looking at AIOps in a broader scope than strict AI does provide the promise of improving monitoring substantially. AIOps is well suited to pointing out discrepancies and anomalies in performance over time. AIOps can alert you to issues caused by releases — for example, that rendering a page used to take 2 database requests — now it takes 10. Identifying deviations in performance between versions of code, and whether they are significant, is going to be an increasingly important role of monitoring as development agility increases. Of course, identifying “this is different than before” is not really AI – it’s statistical processing, done on a large scale, in a system informed by dependency and flow data. The value of AIOps is that it can use topology, information about releases, and dependency data to infer the most likely significant (root-cause like) data changes. It can also apply machine learning to be trained on what the human administrators regard as significant. Given AI is currently well suited to replace human decisions that take less than a second (locating cars, recognizing faces) – there aren’t many such tasks in monitoring a complex data center. However, identifying alerts as not meaningful, or a symptom of some other issue, is one of the tasks that ML can be trained on to alleviate human workload.

This is the direction LogicMonitor plans on taking — to combine performance data, cost data, performance alerts, release data, topology, and dependencies to generate meaningful notifications— and then use machine learning to reduce that set even further to actionable notifications, with suggested (and possibly even automated) resolution.
Speaking as a former network and systems administrator that was often on call — the easier we can make life for the people on the front lines of monitoring and alert response, the better for everyone. And of course, that’s our whole focus at LogicMonitor.