What you’re monitoring says a lot about your IT efficiency. By taking a closer look at the tools and processes used by your IT and Dev teams, you can evaluate your organization’s monitoring maturity model.
Knowing where your IT organization sits in its monitoring maturity can help drive better financial decisions, and provide a top-level idea of what areas to focus on to improve.
The five stages of monitoring maturity are: provisional, diagnostic, integrated, intelligent, and predictive. Use this guide to identify where your team sits and how to overcome common challenges to get a more extensible, more observable IT environment.
Provisional Stage Overview
A Provisional Stage organization has monitoring tools – possibly lots of them. But they work in silos and often involve manual work. In many cases, the tools they have inherited came with other purchases – enterprise applications or hardware that has its own limited monitoring built-in.
Provisional Stage Ops organizations find themselves troubleshooting with limited context. This is frustrating for the engineers who find themselves responding to issues and fielding unnecessary alerts in the middle of the night – or worse, ignoring critical alerts as a result of frequent alert storms. Limited context slows the ability to troubleshoot issues efficiently.
Visibility across infrastructure is simply not possible with so many different tools. Teams are forced to operate with different data sets and different alerting systems. Correlating issues to get to the root of the problem is like playing a game of telephone in a different language, leading to finger-pointing and discord across and within teams.
Signs You’re in the Provisional Stage
Cloud agility – You don’t use cloud, or it’s not integrated with other implementations.
Architecture – Your IT architecture is centralized, on-prem architecture.
Tech stack – You have tools that work in silos, and are often implemented manually.
Teams and development – ITOps consists of ticket taking and fixing critical errors, with little time for org-wide improvements. Alert storms for big and small issues are regular. DevOps is non-existent, making a waterfall approach to any software development.
Common Challenges for Provisional Stage IT Organizations
Alert Storms and Finger-pointing – Many organizations in the Provisional Stage find themselves with monitoring that is siloed by technology vendors. Without a centralized alerting system, disparate tools are firing off their own alerts. This creates alert storms. A lack of visibility across systems leaves teams without the context required to separate the signal from the noise. And because teams are only alerted to the fact that a problem is occurring, not where the problem is occurring, time is wasted finger-pointing instead of problem-solving.
Difficulty Making Strategic Decisions – Your IT organization is responsible for the health of an infrastructure that is growing in both size and complexity. Yet because performance data is isolated within tools in the Provisional Stage, a higher-order view of historical performance and system-wide trends becomes impossible. This complicates long-term, strategic decision-making.
Purchase decisions, for example, are difficult to make without a holistic approach to forecasting, and justifying these decisions to management can be an exercise in frustration without the data to back them up.
IT is Viewed as a Cost Center – Reporting on system-wide performance and uptime is difficult, if not completely impossible because the data you need is housed within separate technologies. Without the ability to demonstrate the value your teams are delivering to the business, IT is perceived as a cost center, which can lead to budget cuts and being left out of strategic decision-making.
Strategies to Improve for Provisional Stage IT Organizations
Reduce Monitoring Siloes – Get rid of as many vendor-provided monitoring tools as possible in favor of one or more tools that support multi-vendor environments. Siloed tools won’t provide the end-to-end visibility and context you need. When you consolidate tools into a unified view, all of sudden, teams are working with similar data sets and viewing the same dashboards. Root cause is more easily discovered, so teams spend less time pointing fingers and more time resolving issues.
More Granular Monitoring – This can be achieved as easily as increasing the polling frequency for services and devices from every five minutes to one-minute intervals. This increases the amount and granularity of data recorded. With broadened monitoring granularity, problem resolution is made more effective, as is historical tracking and predictability.
Monitor Service Performance – Many organizations are moving away from a device-centric view of IT in favor of hybrid and complex environments. By simply defining the components of a service (i.e. all the devices that contribute to the performance of a service) and the KPIs you want to report on, dashboards can bring data together and report on service-related performance. Executives and upper management can easily digest these types of dashboard views, allowing strategic decisions for the business to be made more confidently.
Diagnostic Stage Overview
Organizations in this stage have divested themselves of the technology-specific, point monitoring solutions they inherited as the business grew. This is a critical step on the path toward more efficient operations.
Rather than deal with the firefighting that comes with point solutions, organizations in the Diagnostic Stage have made the wise decision to purchase dedicated monitoring tools that cover a number of technologies. As a result, there has likely been an improvement in mean time to resolution (MTTR), better performance against SLAs, and happier teams.
But IT organizations at this stage are not as efficient as they could be. Monitoring is likely still housed within multiple, independent systems. The main challenge that Ops teams face in this scenario is the need to consult multiple tools and multiple data sets for troubleshooting. It’s incredibly difficult to identify the root cause when relevant data needs to be manually correlated across multiple systems. The result is a term we refer to not-so-lovingly as ‘chair swivel.’
Signs You’re in the Diagnostic Stage
Cloud Agility – You’re in the cloud, but with individual SaaS or IaaS services.
Architecture – You support on-prem virtualization, and have some elements of your architecture in the cloud.
Tech Stack – You have tools that can keep track of most or all of your tech stack, but they aren’t unified, and some are favored heavily over others.
Teams and Development – Your ITOps team likely has a few tribal knowledge experts, who can be relied on to keep everything working. Development is agile.
Common Challenges for Diagnostic Stage IT Organizations
Institutional Knowledge and Expert Burn-out – When you have multiple monitoring tools, each of them requires specific training – no single person has the time or inclination to learn and manage all of the tools. As the number of tools increases, so does specific institutional knowledge within the engineering team(s). Engineers find themselves becoming proficient at some tools and not others, which causes various initiatives to be dependent on the participation of specific experts and also leads to burnout.
Chair Swivel – Issues will invariably arise that require correlation across devices that are monitored by different tools. Correlating data across devices that are monitored separately is a manual exercise, which makes it exceedingly difficult to pinpoint root cause efficiently.
Communicating IT Business Value – You are charged with maintaining a healthy infrastructure. Ensuring uptime and troubleshooting issues requires granular monitoring at the device level, but translating that into terms that teams outside of IT understand is nearly impossible. Enter service-level reporting. When you can abstract monitoring to a level higher than device performance, you can start talking about service performance and SLAs. This is not possible with independent monitoring tools, since there’s no easy way to correlate data between the subsystems (monitored by different tools) that underpin mission-critical applications and services.
Strategies to Improve for Diagnostic Stage
Evaluate Your Existing Toolset – Determine which tools provide the most visibility across different data sets. You’ll want to consolidate disparate tools into the most comprehensive option. The ability to monitor infrastructure end-to-end (in hybrid environments, spanning public cloud, remote data centers, and on-premises) with a single pane of glass means your team would be able to monitor and report on the health of services and mission-critical applications in addition to the underlying infrastructure, ensuring optimal performance and minimizing downtime.
Set Thresholds – Establish thresholds across device groups and services to route alerts to different teams based on severity, device, technology, groups, or even time of day. This will reduce noise and make the communication of alerts more precise by ensuring they get to the best person/team to address the issue.
Identify the Most Compelling Integrations – Begin to implement automation for ticketing systems, proactive resource allocation, and failure analysis.
Integrated Stage Overview
Integrated Stage organizations have moved beyond firefighting caused by team and technology silos, and typically enjoy end-to-end visibility of their environments. This is made possible thanks to the difficult but impactful steps organizations have taken to consolidate multiple monitoring tools into a single monitoring system.
In the Integrated Stage, an IT organization can visualize a significant portion of infrastructure – network, compute, and storage – in one unified view. As a result, troubleshooting and reporting capabilities have improved because they originate from a single data pool.
Integrated Organizations usually embrace a DevOps culture and framework to optimize business agility. At this level, it’s imperative to monitor the entire development pipeline – from code to release. To do this, operations and development teams leverage automation tools like Chef, Puppet, and Ansible, which enable more collaboration, faster application deployment, and more immediate troubleshooting.
When these capabilities are fully leveraged, the IT team is positioned to be more strategic, using freed-up time and resources to pursue projects that propel the business forward. Even better – and this is critical – everyone from executives to engineers has a clear view of the health of the business using dashboards and reports. When that happens, IT is no longer viewed as a cost center, but as a critical driver of the business.
IT teams in the Integrated Stage are on the cusp of adopting machine learning into their workflows; however, in the absence of predictive capabilities, much of the heavy lifting is still manual. There are a few challenges holding teams back in this stage.
Signs You’re in the Integrated Stage
Cloud Agility – IT is normally extended internally into the cloud.
Architecture – Your cloud and on-prem architectures work alongside each other, providing room to grow and opening up expansion possibilities.
Tech Stack – Your tools are used to coordinate and monitor your entire environment, and all work together.
Teams and Development – Your ITOps team is often able to automate normal tasks, allowing room for proactive development. Development is done with a DevOps mindset.
Common Challenges for Integrated Stage IT Organizations
Manual Correlation Requirements – Without machine learning, manual effort is required to detect patterns in the behavior of resources/ devices, and correlate them to issues. Context is key, but hard to develop manually, leading to less efficient issue resolution. AI – or machine learning – is more efficient at detecting patterns of behavior over time so that alerting becomes more intelligent and Ops managers can be more effective at root cause analysis.
Manual Forecasting and Capacity Planning – Though your team has a unified view into a significant portion of the overall infrastructure, there’s most likely a need to aggregate data from a number of sources to adequately forecast and plan capacity.
Strategies to Improve for Integrated Stage
Look Forward – Leverage your monitoring tool to show historical data and graph projections. Forecast when alert thresholds will be crossed, allowing for planned growth with no downtime.
Look Back -Implement proactive tracking of event history to allow for analysis of trending failures; this allows for dynamic thresholds that more accurately reflect ‘normal’ without creating unnecessary alerts.
Customize and Elevate Monitoring to the Service Level – Expand monitoring capability to support dynamic, mission-critical services so that you can report on the health of the service, rather than having to extrapolate it from many underlying resources and devices that underpin the service. Set KPIs to include custom metrics and devices specific to your mission.
Achieving Full-stack Observability
Intelligent Stage Overview
Organizations in the intelligent stage use machine learning to extract essential insights from large pools of operational data. Those insights are used to teach IT systems enabling them to constantly and independently improve their ability to discern root causes.
What’s the end state? Systems become opinionated over time by learning faults and fixing patterns within the infrastructure. In essence, the system becomes like an operator’s personal concierge.
Signs You’re in the Intelligent Stage
Cloud Agility – Hybrid IT is achieved through multiple cloud deployments and virtualization, allowing for seamless scalability.
Architecture – Containers and microservice architecture allow you to scale quickly while providing flexibility to seamlessly get the most out of multiple environments.
Tech Stack – Your tools go beyond monitoring your tech stack, and are able to achieve observability by aggregating devices and resources together, as well as through alerting past the device level and into the degree of importance.
Teams and Development – IT Issues are rare, and the process for fixing issues is stable and done in a timely manner, allowing for more time to pre-empt problems and improve. The lines between ITOPs, DevOps, and DevSecOps blur together, creating a robust development process.
Intelligent Stage Machine Learning
In this stage, machine learning addresses several challenges experienced by organizations that reside in the preceding stages. It can:
Understand and Prevent Outage Scenarios – Examinations of problematic data are used to develop an understanding of how specific scenarios lead to brownouts or outages. The system becomes capable of alerting Ops teams to potentially dangerous or malicious information without performing an in-depth analysis every time. This dramatically speeds remediation and prevents imminent outages before they happen.
Automate Root Cause Analysis and Remediation – Effectively, the burden to intervene, identify root cause, and remediate an issue is shifted from the human to the system. This is significant for issue resolution, especially in heterogeneous environments that span on-prem, remote data centers, and/or public clouds.
Intelligent Stage organizations achieve more sophisticated visibility into their infrastructure by integrating a best-of-breed toolset across logging, application performance, and monitoring, which results in true end-to-end visibility for IT. Organizations in this stage move beyond just identifying issues to a proactive resolution of those issues. In other words, they inevitably get to a place where the systems recommend the best course of action to resolve and ultimately prevent issues.
The benefits of proactive resolution are legion. Internally, costs decrease dramatically, because fewer ITOps tools are required for daily operations and thus fewer man-hours are required to maintain them. In this scenario, the IT team makes a dramatic shift from reactive to truly strategic in nature. Happier customers summarize the external benefits. It’s made possible with greater service availability and reliability.
Opportunities to Innovate
There are two specific areas that can accelerate your organization’s ability to ascend to fully automated IT ops:
Integrate Monitoring into Additional Business Management Systems – This leads to significant cost optimization because it allows remediation options to be correlated with their associated costs. Operations teams start to optimize and filter for the lowest cost solution, most expedient solution, etc.
Use Multivariate Analysis to Enhance Machine Learning – Multivariate analysis will speed a system’s ability to identify and predict issues, and automate the most relevant remediation.
Here are the key steps to innovate further:
Discover – Direct historical monitoring data to machine learning and predictive analysis engines to allow automated discovery of issues.
Respond – Automate processes for self-healing infrastructures based on predicted failures.
Manage – Implement proactive service-level monitoring, allowing for redundant resource allocation, failover automation, and ultimately ‘lights out’ IT infrastructure management.
Predictive Stage Overview
In its final form, the Predictive Stage organization’s basic infrastructure monitoring and management is fully automated, allowing IT teams’ focus to shift to strategic business initiatives. This is where the full vision of digital transformation as customer-driven and technology-empowered becomes a reality.
This stage describes automated IT operations, where monitoring, analyzing, and fixing most issues occurs automatically through the use of AIOps. In this stage, the monitoring system monitors, analyzes, and repairs most of the issues on its own and surfaces less than 20% of issues to the user. The issues that do get raised are only those for which remediation is not yet automated.
To reach this state of hyper-efficiency, an organization must position itself to take advantage of the vast amounts of historical data required to teach systems how to identify and ultimately resolve issues without user input.
That means an organization needs tools with access to this type of data and rules for training self-healing algorithms. Better yet, a hybrid-capable, infrastructure SaaS monitoring platform that can constantly learn and improve over time from multi-tenant datasets.
Signs You’re in the Predictive Stage
Cloud Agility – Not only are you fully hybrid, but cloud decisions are made based on business value.
Architecture – Next-gen architecture is able to grow flawlessly.
Tech Stack – Full visibility is available to every aspect of your tech stack, and machine learning is implemented to detect anomalies long before they become problems.
Teams and Development – Your Ops team works closely with the business itself, dictating what’s possible based on the technology available. Scaled DevSecOps.
A self-healing platform exercises maturity in all three stages of a modern monitoring system: monitor, analyze, and act. Self-healing maturity begins with automating simple infrastructure tasks based on rules to proactively fix problems within the infrastructure. This signals an important move away from a reliance on alerts. These rules are derived from one or more of the following sources:
Organizational Knowledge Base – This includes heuristic rules that operators can create using the metrics, events, and transactions collected by the monitoring system. These rules are typically documented as runbooks and are triggered automatically when a specific condition occurs in the infrastructure and is detected in the data.
Dynamically Derived Scopes – Dynamically derived scopes are learned patterns within the collected data that depict normal versus abnormal conditions at any point in time across the entire infrastructure stack. These conditions trigger automatic actions which are configured using runbooks or IT policies.
Intelligently Learned Patterns – Learned from historical patterns across multi-tenant environments, the self-healing system performs actions based on certain conditions happening within the infrastructure. The self-healing system uses sophisticated AI techniques like “reinforcement learning” to learn automation from historical actions performed across datasets from multiple customers.
How LogicMonitor Can Help You Reach the Predictive Stage
At LogicMonitor, we have embarked on a journey to deliver the best AIOps product in the world. It aligns with our customers’ mission of automating infrastructure from the network to the cloud.
LogicMonitor provides AIOps functionality to:
Intelligently reduce alert noise by preventing alert storms and invalid alerts from occurring.
Make remaining alerts more meaningful and actionable by providing correlations and metadata to identify the root cause of issues more quickly, and provide more intelligent context around events and alerts.
Avoid issues before they affect service by providing forecasting using machine learning to spot predictive patterns, alerting on anomalies, rate of change, performance trends, etc.
Enable sophisticated infrastructure showback and cost analysis attributed to the business unit or by service/application.
LogicMonitor is the most extensible hybrid monitoring platform and delivers comprehensive AIOps capabilities and visibility for complete infrastructure automation. If you aren’t already using LogicMonitor, visit us at logicmonitor.com for a free trial and let us show you how to automate ITOps so you and your team can focus on innovation and revenue generation.