Rudolf Emil Kalman, born in Hungary, 1930, is regarded as the creator of various fundamental systems concepts. His work on the structural aspects of engineering systems included control theory: the use of mathematics to adjust the output of a given data stream, which included the concept of observability. Observability is the measurement of the internal state of a system purely by examining the outputs.
It’s clear from this snippet of history that observability is not a new concept or a tech buzzword. It’s been appropriately co-opted by the software industry as a crucial way to use three key forms of telemetry – the three pillars of observability – and other factors to observe and identify issues within systems.
- What Is Observability?
- What Are the Three Pillars of Observability?
- Other Considerations to Reach Observability
- Why Is Observability So Important, and Why Are So Many Companies Pushing for Observability?
- Going Beyond the Three Pillars of Observability
- The Main Benefits of Observability
- The Main Challenges in Achieving Observability
- How Do You Build an Observable Future?
- How Close Are You to Full Observability?
What Is Observability?
Something is observable if it’s possible to tell the current state of the system only from its externally processed data. In technology terms, that means observability (often abbreviated o11y) is being able to tell what’s going on within the complex systems, processes, and microservices of an entire tech stack and/or application, purely from the existing data streams collected.
A system that is not observable requires additional coding and services to assess and analyze what’s going on. When issues arise, this can be anything from inconvenient to disastrous, particularly if those system issues are causing downtime or a poor user experience (UX).
O11y is the shortened wording of observability, and the two terms are synonymous. That said, o11y wasn’t exactly coined by Kalman, and now refers to the more monitoring-specific definition of observability.
Observability vs. Monitoring
Is observability the same as monitoring? No, although the ability to achieve observability may somewhat rely on effective monitoring tools and platforms. It could be said that effective monitoring tools augment the observability of a system.
Monitoring is an action, something someone does: they monitor the effectiveness or performance of a system, either manually or by using various forms of automation. Tools for monitoring collate and analyze data from a system or systems, providing insights and suggesting actions or adjustments where necessary. Some monitoring tools can provide basic but crucial information like alerting a system administrator when a system goes down, or conversely when it goes live again. Other monitoring tools might measure latency, traffic, or other aspects of a system that can affect the overall user experience. More advanced tools may link to dozens, hundreds, even thousands of data streams providing broad-ranging data and analysis for complex systems.
Observability isn’t a form of monitoring. Rather, it’s a description of the overall system; an aspect of the system that can be considered as a whole, in the same way that someone might talk about the functionality of the system. The observability of a system determines how well an observer can assess the system’s internal state simply by monitoring the outputs. If a developer can use the outputs of a system — for example, those provided by APM — to accurately infer the holistic performance of the system, that’s observability.
To summarize, monitoring is not a synonym for observability or vice versa. It’s possible to monitor a system and still not achieve observability if the monitoring is ineffective or the outputs of the system don’t provide the right data. The right monitoring tools or platforms help a system achieve observability.
What Are the Three Pillars of Observability?
A common way to discuss observability is to break it down into three types of telemetry: metrics, traces, and logs. These three critical data points are often referred to as the three pillars of observability. It’s important to remember that although these pillars are key to achieving observability, they are only the telemetry and not the end result.
Just like the captain’s log on a ship, logs in the technology and development world provide a written record of events within a system. Logs are time-stamped and may come in a variety of formats, including binary and plain text. There are also structured logs that combine text and metadata and are often easier to query. A log can be the quickest way to access what’s gone wrong within a system.
Metrics are various values monitored over a period of time. Metrics may be key performance indicators (KPIs), CPU capacity, memory, or any other measurement of the health and performance of a system. Understanding fluctuations in performance over time helps IT teams better understand the user experience, which in turn helps them improve it.
A trace is a way to record a user request right along its journey, from the user interface throughout the system, and back to the user when they receive confirmation that their request has been processed. Every operation performed upon the request is recorded as part of the trace. In a complex system, a single request may go through dozens of microservices. Each of these separate operations, or spans, contains crucial data that becomes a part of the trace. Traces are critical for identifying bottlenecks in systems or seeing where a process broke down.
Other Considerations to Reach Observability
Even using this telemetry, observability isn’t guaranteed. But obtaining detailed metrics, logs, and traces is a great way to approach observability. There is some crossover between these different types of telemetry, especially in the data they provide.
For example, metrics around latency may provide similar information to a trace-set on user requests that’s also giving information on latency by showing where in the system the latency occurs. That’s why it’s important to view observability as a holistic solution — a view of the system as a whole, but created using various types of telemetry.
Events are another type of telemetry that can be used to help achieve observability. Often, logs, metrics, and traces are used to provide a cohesive view of system events, so they can be considered as more detailed telemetry of the original data provided by various events within a system.
Dependencies or dependency maps give the viewer an understanding of how each component of a system relies on other components. This helps with resource management, as ITOps can clearly understand which applications and environments are using what IT resources within a system.
Regardless of which exact types of telemetry are used, observability can only be achieved by combining various forms of data to create this “big picture” view. Any single one of the pillars of observability on its own provides very little value in terms of visibility and maintenance of a system.
Why Is Observability So Important, and Why Are So Many Companies Pushing for Observability?
Although observability as a concept comes from the realm of engineering and control theory, it’s widely adopted by the tech world. Technology is advancing so quickly that developers are under pressure to constantly update and evolve systems, so it’s never been more crucial to understand what’s going on inside those rapidly changing systems.
Achieving observability empowers users to update and deploy software and apps safely without a negative impact on the end-user experience. In other words, o11y gives IT teams the confidence to innovate with their apps and software as needed.
Observability provides developers and operations teams with a far greater level of control over their systems. This is even more true for distributed systems, which are essentially a collection of various components, or even other, smaller systems, networked together. There are so many data streams coming from these various systems that manually collating this data would be impossible, which is where automation and advanced, ideally cloud-based, monitoring solutions are crucial to deal with the sheer volume of data. However, to achieve observability, the quality of these data streams has to provide a level of deep visibility that allows questions around availability and performance to be answered and dealt with efficiently.
Modern development practices include continuous deployment and continuous integration, container-based environments like Docker and Kubernetes, serverless functions, and a range of agile development tools. APM-only solutions simply can’t provide the real-time data needed to provide the insights that teams need to keep today’s apps, services, and infrastructure up and running and relevant to a digitally demanding consumer base. Observability indicates high-fidelity records of data that are context-rich, allowing for deeper and more useful insights.
Going Beyond the Three Pillars of Observability
It’s important to note that the three pillars of observability by themselves don’t provide the holistic view that’s desirable when considering the internal states of systems. A better way to think about unified observability might be to cohesively combine the three pillars plus those other considerations mentioned above, to look at an overall, detailed image of the entire tech ecosystem rather than focusing on individual components.
Imagine one of those pictures that’s made with multiple leaves of tracing paper. The first leaf has a background on it. Lay down the next leaf, and you can see some trees, maybe some houses. The next leaf has some characters on it, while the final leaf has speech bubbles on, showing what everyone is saying and doing. Each leaf on its own is an accurate part of the picture but makes little sense on its own. It’s completely out of context.
Putting all the components together creates a detailed picture that everyone can understand. That is effective observability, and it can only be achieved by carefully considering all the components as one holistic view of the system. Each data point is a form of telemetry, and observability is achieved by combining telemetry effectively. Observability-based solutions provide a platform and dashboard that allow users to tap into that detailed telemetry and maximize their development opportunities.
There’s a tendency for IT teams to believe they have to pay three different vendors to provide the three pillars of observability, which becomes a potentially costly exercise. Once again, we return to the very crucial point that observability is holistic, which means that having all this telemetry separately is not the key to achieving observability. The telemetry needs to work together, meaning that while the three pillars are a good starting point, they’re not an endpoint or a guarantee of an effectively maintainable system.
The Main Benefits of Observability
Of course, there’s no point focusing on improving a system’s observability unless there are measurable benefits, not just to the developer in terms of ease of system maintenance and improvement, but to the business as a whole.
The majority of decisions in any business are driven by concerns about cost or profit. The more costly an app or system is to develop, operate and update, the more that cost has to be potentially passed on to the consumer. That can reduce the apparent value of systems, so anything that keeps costs down and increases profits is welcome. ITOps and DevOps costs can be lowered by having a better and more intuitive understanding of the internal states of systems. Effective data streams paired with data analytics and the right monitoring platforms mean less manual observation is required, plus updates can happen faster and with more confidence. This reduces the number of employee hours required to keep a system at peak performance, plus faster, more effective updates increase the value of the system as a whole.
DevOps Can Focus On UX
When developers and operations teams aren’t having to spend hours diving deep into the internal workings of a system, they have more productive time available. This is time that can be spent creating better ways for users to engage with the metrics that matter and improving the overall user experience. Again, this has financial implications, as the better a system works, the more desirable it is to consumers. Even free-of-charge software and apps increase the value of a company when they work seamlessly, because of the increase in positive reputation of the development team in question.
Avoiding Downtime and Other Crises
Whether a business’s systems are completely internal or used to provide services to consumers, downtime can be devastating. It’s great to have contingency plans in place for when IT crises occur, but it’s even better to have an observability solution that allows systems to stay online for the majority of the time. Having a detailed yet holistic understanding of the state of a system allows changes and updates to be made promptly to deal with issues that may otherwise cause the system to go down entirely. Observability promotes stability, something consumers expect more and more from the apps and software they use daily.
Better Planning for the Future
Deep visibility of the internal workings of a system combined with rich data and event analysis doesn’t just help DevOps deal with what’s happening right now, it helps them project future events and plan effectively for the future. This can be via understanding potential peak times and capacity fluctuations, allowing for the effective reallocation of resources. This can also alert DevOps to quieter times when testing of new services might be more appropriate. Plus, having an observable system means those tests become all the more effective and less likely to cause a system crash or downtime.
The Main Challenges in Achieving Observability
We’ve already mentioned some of the challenges that occur when trying to achieve observability. A key one is becoming fixated on having the best metrics data, or the most complete traces, to the point of paying individual vendors to provide these. While this can be an effective solution for those companies willing and able to collate and combine this telemetry, it’s much closer to observability to have all this information in one place.
Another frequently occurring challenge is not getting hung up on the three pillars of observability, but being willing to look beyond them, as we explored further up in the article. Other challenges include:
- Scalability – achieving the same level of observability no matter how software, apps, and systems grow (or shrink).
- Increasingly complex cloud environments.
- Dynamic containers and microservices.
- Increasing volumes and types of data and alerts from a variety of sources.
- DevOps and other teams using multiple monitoring or analytics tools that don’t necessarily sync with each other.
It may seem daunting for some companies to overcome these obstacles, but it is possible to achieve good observability by looking at sensible and efficient ways to deal with these challenges head-on.
How Do You Build an Observable Future?
Dealing with the scalability of observability means addressing some of those other challenges first. When your system deals with a range of cloud-based environments, it’s worth thinking about advanced monitoring tools that function either exclusively on the cloud or in a hybrid environment. Tools that are built to deal with the modern cloud are more likely to adapt to changes within cloud environments, giving users stability and consistency of data analysis.
Major vendors like Amazon Web Services (AWS), GCP, and Azure all support something called OTel, or the OpenTelemetry Project. This project aims “to make high-quality telemetry a built-in feature of cloud-native software.” This is great news for DevOps teams investing in a cloud-based future for their apps and software. OTel bills itself as “an observability framework,” providing tools, SDKs, and APIs purely for analysis of a system’s performance and behavior. The aim is to provide some level of observability regardless of which third-party vendors businesses choose and to provide low-code or no-code solutions for observability.
Another way to ensure scalability for observability solutions is to ensure the right type of datastore is used. Datastores and data warehouses need to be able to expand and diversify exponentially to deal with the increasing volume of data and the various types of data streaming from a variety of sources. ETL and ELT solutions help bring data together in a single, useable format and into a single destination. Again, it’s about looking at how the system works as a whole and ensuring that every aspect of it can grow as the system or software does.
How Close Are You to Full Observability?
Understanding how close you are to full observability revolves around thinking about the following questions:
- How easy is it for you to obtain key telemetry such as logs, traces, and metrics?
- What level of analysis do you get from this telemetry — i.e. how useful is it to you?
- Do you have to do additional coding and development to understand the internal states of your systems?
- Can you gain a big-picture analysis of your whole system in real-time?
If you answered “Very easy,” “Very detailed,” “No,” and “Yes,” in that order, then you might be close to achieving full observability within your systems and software. A sudden surge in your scaling shouldn’t be an issue, because your detailed analysis will project this and offer solutions that you can easily implement without draining the existing resources of the system. Problems with latency or infrastructure are easily identified thanks to effective traces combined with accurate logs and metrics, displayed effectively for you to address head-on. Downtime rarely happens, and when it does, it’s for the minimal time possible because of the detailed and cohesive view and understanding you have of the systems involved.
If you’re still finding that you can’t deal with issues like these, it may be worth examining the overall observability of your system and what tools or changes you need to make your system more resilient.
In summary, the three pillars of observability are important, but as the sources of telemetry for achieving observability and not the end goal themselves. On top of this, you can use any other useful source of data to help you achieve observability. Complex systems rely on effective monitoring tools that are built with cloud-based environments in mind — but utilizing these tools does not guarantee observability, as observability is a holistic concept regarding the system itself. And finally, whatever observability solutions you are investing in should be adaptable and scalable to grow with your business.