Developing enterprise IT and software-driven consumer products is becoming more complicated by the day. The growing demand for rapid product upgrades requires streamlined performance and stability. To achieve this stability, companies need effective monitoring and observability practices.

Although monitoring and observability are intertwined, they are not interchangeable. They are designed to operate in a fluid cycle that leads to straightforward visibility. In an ever-evolving technical landscape, combining monitoring and observability efforts is crucial in achieving system visibility – a clear picture of where your system is and where it is supposed to be. This post provides a comprehensive breakdown of both practices, their purposes, their differences, and why combining them is necessary to produce a quality product.

What is monitoring?
What is observability?
How do observability and monitoring work together?
How does observability differ from monitoring?
How does monitoring fit into the larger observability ecosystem?
Does observability encompass monitoring?
What makes monitoring stand on its own?
What are the benefits of monitoring and observability?
- Monitoring
- Observability
Getting started with observability and monitoring
The bottom line

What is monitoring?

Monitoring involves using a monitoring system that looks out for common or known problems, alerting when predetermined thresholds are crossed. Primarily driven by time-series data, good monitoring achieves early warning by proactively identifying trends that might lead to problems. By analyzing and validating the changes in application performance, monitoring plays a significant role in driving product decisions. Monitoring also provides a good view of your system’s health and allows you to see the impact of failures and fixes.

What is observability?

Observability is a practice that focuses on monitoring and analyzing the outputs of applications and the infrastructure that they run on. When systems are identified, observability allows you to ask questions about the data and derive insights into how to improve system performance. And when issues arise, observability enables teams to triage, troubleshoot, and understand the state of the system and the reason behind those issues. Once you’ve identified those previously unknown issues, you can begin monitoring for those things to ensure such problems don’t reoccur.

The complexity of modern systems requires a way to convert application data into meaningful information, and observability enables developers to derive deeper meaning and smarter solutions from large volumes of data within a short time.

How do observability and monitoring work together?

Monitoring alerts when something is wrong, while observability endeavors to understand why. Although they each serve a different purpose, monitoring and observability significantly complement each other. In many respects, observability is a superset of core monitoring principles and technologies. An observable system is easier to monitor. Monitoring harvests quantitative data from a system through queries, traces, processing events, and errors. Monitoring is querying your system to get answers.

Once the data is available, observability transforms it into insights and analysis, which helps determine the required changes or metrics that need tracking. If you’re not sure what questions to ask when interrogating a system, this is where observability comes into play.

The data from the monitoring stack forms a basis of observability. The observability stack transforms the statistics and analytics into insights that are applied to the monitoring stack to improve the data that it generates. If the system development stack focuses on observability, it allows the developer to answer those questions in an advanced monitoring stack.

Since both processes keep evolving as technology changes, it is crucial to ensure that all the tools, stacks, and platforms related to monitoring and observability are constantly changing. A system’s observability depends on its simplicity and the capability of the monitoring tools to identify the right metrics. Therefore, observability is impossible without some level of monitoring.

How does observability differ from monitoring?

While observability and monitoring may seem similar, they differ in many ways. Understanding where monitoring and observability differ is essential if you don’t want unprecedented change to cause considerable problems in your development life cycle. Here are some of the differences:

Observability is a practice, while monitoring is driven by technology

Monitoring only tells you that a problem happened. Ideally, observability tells you what caused that problem. Monitoring is a process of collecting data from your system, while observability is the system’s ability to be transparent enough to easily track issues. While observability determines the standards a system should meet for analysis, monitoring utilizes those properties to produce the required data and statistics.

Observability is a superset of monitoring and encompasses many other practices

Observability utilizes logs, metrics, and traces, among other data, such as topology and events, to provide enough data from the system to help solve current and potential issues. Monitoring can also be part of the observability process, along with aggregating logs and tracking individual requests. A system can have low or high observability depending on its speed to solve issues. On the other hand, monitoring only collects data and alerts the developer whenever errors occur.

Observability identifies unknowns, while monitoring aims at reporting known errors

This is one of the main ways in which observability differs from monitoring. The core purpose of observability is to handle existing errors and solve incoming issues before users discover them. Monitoring, on the other hand, identifies the problems known by the user.

How does monitoring fit into the larger observability ecosystem?

As enterprises accelerate their initiatives in a software-centric world, staying safe is essential. Using data from their business outcomes, apps, and the underlying infrastructure, monitoring and observability create a string to pull on when problems arise. That string includes the different types of data previously mentioned – such as metrics, traces, and logs – which are the underpinnings of hybrid observability. Information derived from this data allow you to ask three important questions:

What’s going on in my environment?
Where is the problem happening?
Why is that problem happening?

Answering these questions is key to finding quality solutions as quickly as possible.

What’s going on in my environment?

By having metrics instrumented all over the environment, customers can clearly see when issues are brewing so they can act on them before they blow up. Metrics mostly appear as time-series data that’s plotted on a graph or chart. Metrics tell you whether you have a problem, but they don’t tell you the root cause. Common examples would be: a CPU at 100%, a disk that’s full, or a network link that’s dropping packets.

Where is the problem happening?

With complex systems having so many moving parts, it’s imperative that you find the right pieces to fix quickly. That’s done via traces. Traces not only provide insight into poorly performing services, they identify interactions between services that are performing poorly, which contributes to poor overall performance or availability. Traces help you identify which kinds of transactions or customers might be affected, and on which systems.

Why is the problem happening?

The final question to ask is why. That’s where logs come in. Logs contain all the unstructured data that reveal exactly what happened, when and why it happened, and the context and details required to build the best solution for that issue.

Answering these three questions – using metrics, traces, and logs that are in alignment with your infrastructure, applications, and business – will enable a dramatically faster resolution.

Does observability encompass monitoring?

It’s important to remember that observability isn’t just monitoring in more complex environments; it’s the fusion of varied types and sources of data to better determine why problems are happening, not simply that a problem is happening. In the past, software developers relied on traditional application monitoring to determine what goes on in a system. Their approach failed to achieve app reliability because of insufficient information from monitoring setups.

However, as systems become more complex and refactored into microservices, the sheer volume of monitoring data becomes impossible to correlate and navigate without help. This led to the adoption of observability practices that no longer limit monitoring to data collection and processing but also make systems observable. The observability platform, also known as the O11y, turns data into useful information by combining streams, performing analysis, and detailing the full life cycle.

A modern log management solution requires a platform that supports your stack of infrastructure, applications, and services. Rather than spend long hours pulling out many codes, analyzing errors, and generating messages without context, an observability platform relieves developers by improving their ability to ask the system questions and find answers. The result is better application performance, minimized downtime, and enhanced user satisfaction.

When designing a dependable system, there is a need for collaboration among cross-functional Devs, ITOps, and QA personnel. High demands from both the internal business units and the ultimate end users make it vital for large and small companies to identify and respond to errors fast and in real-time. Traditional monitoring that relies on error logs only scratches the surface. Observability goes beyond collecting error data and can now analyze that data, identify trends, gain insights, and understand the overall health of the system.

Observability can help developers:

Get more insights into the apps they develop
Automate testing
Release better quality code, faster

With observability, productivity improves. Now teams are freed up to optimize collaboration, build better relationships, and focus on innovative solutions and digital transformation. Better still, the end user benefits from a high-quality and refined user experience.

What makes monitoring stand on its own?

Monitoring plays a crucial role in building dashboards, alerting, and analyzing long-term trends. However, it is difficult to make predictions when monitoring complex distributed apps because production failures are not linear. Despite these limitations, monitoring remains a crucial tool for developing and operating microservice-based systems.

Monitoring metrics need to be straightforward and focused on actionable data to get a good view of the system’s state. Since it’s an excellent mechanism to know the current and historical state of a distributed system, monitoring should be considered in every stage of the software development life cycle. The monitoring data helps developers find issues, root-causing failures, debuggability, patterns analysis, and many more features useful for all teams. Monitoring provides the information, statistics, and alerts to trigger actions that will make those events visible to the interested parties.

What are the benefits of monitoring and observability?

Adopting both observability and monitoring benefits the developers and end users alike.

Monitoring

Here are just a few ways that monitoring solutions serve the business:

Save costs

Monitoring tools provide real-time alerts whenever issues arise in the system. This means that you can save money and avoid costly losses by resolving the problems as soon as they occur. Lack of a monitoring setup means organizations waste more resources and money in troubleshooting.

Reduce risk

Monitoring reduces the risk of system intrusions. Any attack attempts or suspicious activities are easy to detect because the system sends alerts to the monitoring teams. Then, teams can respond fast, neutralize the risk, and keep the system secure.

Increase productivity

Every organization wants to increase operational efficiency and productivity. Reinforcing DevOps teams with real-time insights and alerts makes it easy to isolate incident causes and fix them within the shortest time possible. With reduced downtime, teams are more productive and the end-user experience is better. Furthermore, the ability to detect underlying issues before they cause rippling problems can mitigate downtime. All of this allows teams to focus on more strategic initiatives and get the job done efficiently.

Enhance flexibility

Monitoring solutions are highly flexible. Unlike observability, most monitoring solutions are not embedded inside applications’ data source code. Therefore it is easy to switch between different solutions.

Observability

Observability is more about the relationship between data than the data itself. The goal is to focus more on outcomes and make sure that the same problems don’t occur over and over again. A proper observability setup delivers powerful benefits to IT teams, organizations, and end users alike. Some of them include:

Eliminate debugging

The observability platform offers constant surveillance to production applications, making it easier to track issues and ensure that collected and analyzed data can be correlated and used to preempt issues even before they arise.

Monitor health

Rather than merely logging numbers, observability helps determine the app’s health by providing insights into the applications’ future performance. Now you can develop an application while considering the changes required to improve overall health and performance and the metrics required to be tracked.

Build better apps

Observability is a fundamental property of an application and its supporting infrastructure. Since developers create an application to be observable, the DevSecOps teams can easily interpret the observable data during the software delivery life cycle to build better, more secure, more resilient applications.

Automate everything

Observability supports intelligent systems that can self-heal and recover from relatively minor issues without the need for human intervention. An advanced observability solution can also automate more processes, increasing efficiency and innovation among Ops and Apps teams.

Improve customer experience

A good user experience boosts the company’s reputation, grows revenue, and gives the company a high competitive edge by increasing customer satisfaction and retention.

Increase productivity

The infrastructure teams can leverage observability solutions to reduce the time to identify and resolve issues, improve application uptime and performance, and optimize cloud resource utilization.

Getting started with observability and monitoring

As IT environments become more complicated, achieving hybrid observability doesn’t have to be. Managing distributed system infrastructures requires an equally effective set of tools to monitor, analyze, and trace events. These tools allow developers to understand system behaviors and prevent future system problems. The following steps are crucial in implementing observability:

Choose a centralized observability platform

A good monitoring solution is the bedrock of an excellent observability setup. Developers, site reliability teams, and project managers must look for a monitoring solution that suits their needs and unifies data from all the telemetry into one location, explored through a single interface. The solution should be flexible enough to factor in the requirements’ growth as the platform keeps evolving. When done well, developers and operations teams can turn off point tools that have been creating data sprawl and excessive administration effort.

Analyze the metrics of your application

In this step, teams need to analyze the metrics carefully, considering all the metrics of their applications. Missed metrics can make the troubleshooting time-consuming because they might represent an inaccurate trend. Choosing metrics carefully also eliminates redundant data, which might not provide helpful insights and analysis. The monitoring setup is as efficient as the speed at which you can access the log information about errors.

Respond to the trends in the statistics

After setting up a centralized monitoring system and receiving useful logs, it is essential to determine what to do with the data. This can be done through machine learning, where algorithms convert the accumulated data into valuable insights, allowing you to do more in less time. Machine learning helps analyze what is going on in a system and raise an alert before the issues tamper.

The bottom line

Although observability and monitoring work in tandem, it is essential to know that they differ. Observability is a practice that includes monitoring. The choice of either depends on the size of your application, reliability, and goals. Either way, both solutions are necessary to keep crucial systems and applications up and available, enabling a business to innovate and move forward.

We recently sat down with Jude Bakeer, one of LogicMonitor’s Solutions Engineers, to talk about the future of IT and Observability. Part of Jude’s role requires her to talk to customers and enterprises every day. Over the years, she’s gathered unparalleled insights into key trends across these industries and segments – from ops teams to C-level executives. In her day-to-day experiences, Jude sees Observability up close and personal and uses feedback from customers and prospects to continuously improve.

While Jude didn’t have a crystal ball on hand, we did get into what her current ecosystem looks like and what we can expect moving forward into a hybrid world. Jude gave us a look into the market as she sees it, along with what she thinks could be the roadmap for a better future.

LogicMonitor: Thanks for joining us today. So Jude, what’s on your mind regarding the state of IT today?

Jude Bakeer: There are so many buzzwords right now: “online presence” (thanks COVID), more “sophisticated workflows,” “modernization,” “cloud-native,” “ephemeral,” “containerization,” “ITOM,” “synergy,” etc. But I’d boil it down to a single word: “change.” IT is like shifting water; if you embrace the changes, you’ll flow right along. What we thought was impossible just five years ago is now our everyday. Having an open mind to the continual evolution happening around us is critical. And right now, that big shift is Observability.

LM: So what exactly is Observability, and how do you achieve it?

JB: The IT landscape is built on various technologies and vendors. But things change. Teams adopt new processes or tech, and it’s out with the old and in with the new. Observability gives us a method for finding the answers to our questions. Observability measures how well internal states of a system can be inferred from knowledge of its external outputs – technological, mechanical, anatomical, or something else entirely.

A graph showing the path to achieving true o11y as a cyclical process between context, correlation, and intelligence.

To observe the performance of our IT environments appropriately, we’ll need answers that span across these layers. Those answers are delivered through data that we can correlate, process, and contextualize. We’re taking what’s displayed outwards to determine what’s internally happening.

Why’s this important? We make better decisions when we have good data. Full-stack Observability is the KEY to gaining control and visibility of systems as the landscape evolves.

LM: What do you mean by “full stack” Observability?

JB: There are three main types of data required for this view: metrics, traces, and logs.

Some metrics involve tracking CPU, memory, and disk performance. This helps make decisions regarding processor speed or RAM, but modern-day DevOps need access to other metrics. Through application performance monitoring (APM), metrics like page load times and error rates can be incorporated for maximum Observability.

Gain better visibility through metrics.
-Vast breadth o coverage and support or increasingly complex environments.
-View overall performance and availability of the application workflow
-Identify breakdowns and bottlenecks within apps and infrastructure
-Correlate data across layers of your tech stack

Next, we’ve got Traces. A trace paints a clear picture of individual events within the system (known as spans) but also showcases the interactions between parent and child cases.

And finally, we’ve got the logs (flat log files), which can be generated through programming languages and framework libraries based on the running code. By aggregating these logs, you can send and store them to be accessed later, saving time and making the use of logs more effective for debugging.

LM: Great explanation. So how can businesses get started with Observability? What’s your advice?

JB: Plan accordingly. Navigating the ever-changing infrastructure landscape, answering questions, and being confident in your ability to provide insight through Observability will make your team leaner and meaner. Here’s a good outline to follow:

Q1: To achieve Observability, each pillar we talked about should be utilized to improve performance. When all three pieces of telemetry work together, they should provide immense insight that answers queries like, “why is this feature not working?” or “what went wrong during the latest update?”

Q2: Is the telemetry making a difference? If you’re using the three pillars of Observability and aren’t receiving answers or insight into the system’s health, the problem may be with your level of analysis. Deep analyses are required for the most beneficial results in an observable system. Also, be sure that the data collected is relevant to the tasks.

Q3: Real-time data is a critical component of Observability. If you’re not receiving up-to-date information, take measures to improve this as part of your journey to implementing a more observable system. When real-time data is used in an observable system, it provides immense value and aids in troubleshooting or improving performance.

Q4: One of the biggest advantages of having an observable system is that it isn’t necessary to perform additional coding or development to delve into the status of your systems. If you must put in more work to understand functionality or data, you haven’t quite reached Observability.

LM: What are some challenges that might make businesses turn to an Observability solution?

JB: It all comes down to not having the right monitoring tools or platform for your teams to achieve Observability. Achieving an actual state of Observability means that we must avoid common pitfalls when s*** hits the fan. Things like tool sprawl, I’m talking swivel-chair, emailing multiple teams, waiting on responses, writing and re-writing emails because of miscommunication, and then running the right reports on components contributing to an application’s health. And then there’s limited visibility – even if we’ve got the data for the reports! It boils down to asking: do we have the ingredients to bake this cake, or do we gamble on a substitution?

How long is it taking to get answers from the multiple tools and the half-built reports? When something takes time like that, it means we’re losing money. And then there’s struggling to scale, which keeps businesses limited to legacy technologies, and everything is stuck in the same gear.

Lacking o11y
common pitfalls:
-Tool sprawl = difficulty correlating data
-Limited visibility = difficulty viewing data
-Laborious troubleshooting = $ loss to downtime
-Struggling to scale = limiting business growth + making way for obsolescence

LM: What is the main thing you advocate for in terms of Observability?

JB: Even some of the most advanced monitoring tools still provide a limited view of an app or software’s overall health and functionality. This is often because data can be inaccurate, or the data collected may be irrelevant. Observability offers improved visibility so that developers and businesses can make well-informed decisions. A traditional monitoring tool may show an error on an event log. You’d be made aware that there was an error, but there would be little information about why this error occurred. With Observability, you’d receive information from the event log along with metrics and traces that would point you in the direction of “why.” This allows DevOps or ITOps teams to better understand a system and prevent similar situations.

LM: Be predictive when you can.

JB: Exactly! When Observability is in place, developers experience less stress. They can identify any issues in real-time, focusing on fixing them instead of wasting countless hours identifying a problem. The result is less unplanned downtime. This enhances customer satisfaction and boosts the overall quality of an app or system.

When software development and IT operation teams work with observable systems, they spend less time troubleshooting and remedying hang-ups. This enables them to focus on UX, which leads to a more profitable and streamlined app or software. Organizations that adopt agile environments and prioritize Observability are in a prime position for effective scaling. An observable system allows a company to operate more without sacrificing security.

LM: This is great. Thanks so much. So where can I go to learn more?

JB: Thanks for having me! It has been a great chat. If you’re curious to learn more about Observability or my thoughts on the industry, let’s connect. You can follow me on Linkedin here.

How many times in your life have you heard the saying, “Time is money”? Time is valuable, and it’s one of the scarcest resources we have.

Calculating the business value of IT productivity and making the broader organization aware of the opportunity costs associated with certain IT activities is an essential part of any modern CIO’s role. Businesses cannot launch new products and services if the majority of their ITOps or developers’ time is spent fighting fires and troubleshooting. Mean time to resolution (MTTR) matters because MTTR is a leading indicator of how much “innovation drag” a business is experiencing. Time saved on routine IT maintenance/troubleshooting can translate into more time for the CIO and their skilled technical team to innovate.

Strategically Invest in Technology

When the pandemic first occurred in the U.S. in March 2020, many CIOs received what amounted to blanket approval overnight to lift and shift to the cloud. Legacy systems that had been in place for 20 years suddenly weren’t accessible from home, and so they were replaced by more modern SaaS-based solutions.

However, transforming into a pandemic-ready organization overnight wasn’t free. And from recent discussions with analysts, journalists, and IT leaders, it has become clear that the metaphorical IT budget belt is tightening again. More pressure than ever is being placed on CIOs and CTOs to identify specific and measurable business benefits from the technology investments they chose to make during the height of the pandemic.

For those that are doing the math, strategic investments in technology mean time saved that ends up paying off, big time. Paying for software and services to help transform your organization into a more future-proof and resilient organization might result in an initial upfront cost, but will leave your organization with a reduced total cost of ownership in the long run.

3 Things to Consider as Your Organization Looks Ahead to Its Technology and IT Needs in 2022 and Beyond

Align IT tool and budget asks with key business initiatives for expedited C-Suite approval. Every business today needs more performance from their IT, yet many business leaders refuse to dedicate more resources to their tech stack and team. This mismatch often stems from IT making specific technical demands that seem disconnected from the business’s strategic priorities. To secure funding, IT teams need to learn the language of the business and demonstrate ROI in a way that makes sense to business leaders who may not be familiar with certain technologies or terms. If CIOs are able to present a scorecard that shows the impact certain IT investments have made on overall business productivity or the employee and customer experience, for example, additional IT requests become much more palatable.

Invest in solutions that free up engineering time before committing IT to innovation initiatives. IT can and should be a source of innovation for the business. However, IT teams often have to spend full days fighting fires or assembling manual reports. What would these teams become capable of if they had free time available each day? Every minute spent restoring a system to its baseline is time that could instead be spent pushing the envelope to transform how an organization does something, or what’s possible in the future.

Assign a cost to time savings and build it into vendor evaluation criteria. Time is the most important resource we have as both businesses and individuals. Yet not enough effort and attention is spent on measuring the impact of time spent and saved within an organization. Count how many outages your organization experiences every year. How many engineering hours does it take to locate the source of an outage and get everything back up and running? How much revenue is lost while your systems are down? Internal resources are not “free”, and outages may be more costly than you realize. If a vendor observability platform comes with AIOps and automation capabilities designed to prevent outages or surface anomalies that may cause outages in the future, what are the cost savings associated with those features?

To truly understand and maximize the time value of technology, achieving observability across the entire tech stack is a priority. The most successful modern CIOs are those who have achieved observability and know how to measure, prove results, and pass those findings on to the business. Here at LogicMonitor, we make software that provides CIOs with the observability they need to prove ROI and drive better collaboration with the business.

In this blog series, we share the application instrumentation steps for distributed tracing with OpenTelemetry standards across multiple languages. Earlier, we covered Java Application Manual Instrumentation for Distributed Traces, Golang Application Instrumentation for Distributed Traces, Flask Application Manual Instrumentation for Distributed Traces, and DotNet Application Instrumentation for Distributed Traces.

In this blog post, we are going to cover:

The value of OpenTelemetry and Tracing
Instrumenting a Ruby application using LogicMonitor
Next steps
External resources

The Value of OpenTelemetry and Tracing

Use this guide to write a Ruby application emitting traces using the OpenTelemetry Protocol (OTLP) Specification. All you need is a basic understanding of developing applications in Ruby. New to OpenTelemetry? Read more about it.

OpenTelemetry for Ruby can be used to add automatic and manual instrumentation to your applications. Automatic instrumentation is enabled by adding instrumentation packages. Manual instrumentation can be added using the OpenTelemetry API.

Installation

Prerequisites

These instructions will explain how to set up automatic and manual instrumentation for a Ruby service. In order to get started, all you will need:

Ruby installed (2.5+)

In this guide, we will auto-instrument a simple “Hello World” application using Sinatra, a Ruby Domain-specific language (DSL) for creating web apps with minimal effort. Follow along with an existing Ruby project of your own.

Initialize the New Project

mkdir sintara-hello-world
cd sintara-hello-world
vi Gemfile

GemFile

source "https://rubygems.org"
gem "sinatra"

hello.rb

require 'rubygems'
require 'bundle/setup'
    get '/frank-says' do
      'Put this in your pipe & smoke it!'
    end
    get '/hello' do
      'sintara says hello!'
    end
    get '/' do
      'exisitng paths: /hello, /frank-says, /logicmonitor!'
    end
    get '/logicmonitor' do
      'Hello from Logicmonitor!'
    end

Run the Sinatra Server

bundler install
ruby hello.rb
> == Sinatra (v2.1.0) has taken the stage on 4567 for development with backup from Thin
> 2021-11-16 11:44:02 +0530 Thin web server (v1.8.1 codename Infinite Smoothie)
> 2021-11-16 11:44:02 +0530 Maximum connections set to 1024
> 2021-11-16 11:44:02 +0530 Listening on localhost:4567, CTRL+C to stop

The Sinatra server should be up and running. Let’s instrument it to export traces to LogicMonitor’s APM portal.

Install the Open Telemetry Client Gems

The first step is to add the following gems to your Gemfile:

Gemfile

gem 'opentelemetry-sdk'
gem 'opentelemetry-exporter-otlp'
gem 'opentelemetry-instrumentation-all'

Including opentelemetry-instrumentation-all provides instrumentations for several frameworks such as Rails, Sinatra, and database drivers and HTTP libraries.

Initialization

It’s best to initialize OpenTelemetry as early as possible in your application lifecycle. We will add this in hello.rb

OpenTelemetry initialization:

hello.rb

require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
require 'opentelemetry/instrumentation/all'
 
OpenTelemetry::SDK.configure do |c|
  c.service_name = 'sintara-hello-world'
  c.use_all() # enables all instrumentation!
end

Now that you have set up your application to perform tracing, we’ll need to configure the SDK to export the traces to the LogicMonitor APM portal. We use the OTLP exporter, which the SDK tries to use by default. Next, we’ll use LogicMonitor’s OpenTelemetry Collector to receive these traces and visualize them in the LM APM portal.

Putting It Together

Gemfile

source "http://rubygems.org"
 
gem 'sinatra'
gem "opentelemetry-api"
gem "opentelemetry-sdk"
gem "opentelemetry-exporter-otlp"
gem 'opentelemetry-instrumentation-all'

hello.rb

require 'rubygems'
require 'bundler/setup'
require 'opentelemetry/sdk'
require 'opentelemetry/exporter/otlp'
Bundler.require
OpenTelemetry::SDK.configure do |c|
    c.service_name = 'ruby-otlp'
    c.use_all
   end
get '/frank-says' do
      'Put this in your pipe & smoke it!'
    end
    get '/hello' do
      'sintara says hello!'
    end
    get '/' do
      'exisitng paths: /hello, /frank-says, /logicmonitor!'
    end
    get '/logicmonitor' do
      'Hello from Logicmonitor!'
    end

cd sinatra-hello-world
gem install opentelemetry-instrumentation-all
gem install opentelemetry-sdk
gem install opentelemetry-exporter-otlp
bundler install

Run the Application

ruby hello.rb
[2021-11-16 16:11:32] INFO  WEBrick 1.6.0
[2021-11-16 16:11:32] INFO  ruby 2.7.1 (2020-03-31) [x86_64-darwin19]
== Sinatra (v2.1.0) has taken the stage on 4567 for development with backup from WEBrick
[2021-11-16 16:11:32] INFO  WEBrick::HTTPServer#start: pid=3794 port=4567

Exporting Traces

Note: These steps require having access to Logicmonitor’s APM license.

Install the LM Otel collector using docker.

docker run -d -e LOGICMONITOR_ACCOUNT=<account> -e  LOGICMONITOR_BEARER_TOKEN= <bearer token> e LOGICMONITOR_OTEL_NAME="<collector name>" -p 4317:4317 -p 4318:4318 logicmonitor/lmotel:latest

Next, we’ll have to let the SDK know where the collector endpoint receives traces. Set the OTEL_EXPORTER_OTLP_ENDPOINT environment variable to http://0.0.0.0:4318:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://0.0.0.0:4318

Now, let’s test it out. Start your application and perform a few operations to generate tracing data, e.g. navigate your web app or kick off background tasks.

Navigate to Traces on the LogicMonitor portal and search for traces related to your service. These were generated via OpenTelemy auto-instrumentation!

Success! Viewing LogicMonitor Traces

You can view additional information by clicking on a trace.

Viewing additional trace information within the LogicMonitor platform.

Here is what a Constructed Trace looks like:

Resource labels:
-> service.name: STRING(ruby-otlp)
-> process.pid: INT(7614)
-> process.command: STRING(helloworld.rb)
-> process.runtime.name: STRING(ruby)
-> process.runtime.version: STRING(2.7.1)
-> process.runtime.description: STRING(ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin19])
-> telemetry.sdk.name: STRING(opentelemetry)
-> telemetry.sdk.language: STRING(ruby)
-> telemetry.sdk.version: STRING(1.0.1)
InstrumentationLibrarySpans #0
InstrumentationLibrary OpenTelemetry::Instrumentation::Sinatra 0.19.2
Span #0
Trace ID : e4e6e6c683f2cbb8ab5404a6604b7c01
Parent ID :
ID : cadc573256df7294
Name : /frank-says
Kind : SPAN_KIND_SERVER
Start time : 2021-11-16 07:32:30.577101 +0000 UTC
End time : 2021-11-16 07:32:30.579635 +0000 UTC
Status code : STATUS_CODE_ERROR
Status message :
Attributes:
-> http.method: STRING(GET)
-> http.url: STRING(/frank-says)
-> http.status_code: INT(404)

For manual instrumentation, please refer to official OpenTelemetry docs.

Next Steps

Congratulations, you’ve just written a Ruby application emitting traces using the OpenTelemetry Protocol (OTLP) Specification. You can use this code as a reference when you start instrumenting your business application with OTLP specifications.

Now, you’ll have visibility into your application to address any potential bottlenecks present in your application code. If you’re already using LogicMonitor to monitor your infrastructure or collect logs, you’ll be able to associate these traces for faster troubleshooting.

Check back for more blogs setting up distributed tracing with OpenTelemetry standards across other application languages.

External Resources

We recommend the following resources to learn more about OpenTelemetry and CNCF:

Join the CNCF Slack community
Check out OTel documentation

In this blog series, we share the application instrumentation steps for distributed tracing with OpenTelemetry standards across multiple languages. Earlier, we covered Java Application Manual Instrumentation for Distributed Traces, Golang Application Instrumentation for Distributed Traces, Node JS Application for Distributed Traces, and DotNet Application Instrumentation for Distributed Traces.

In this blog post, we are going to cover:

Value of OpenTelemetry with tracing
Instrumenting a Python Flask application using LogicMonitor
Next steps
External resources

OpenTelemetry is a project by the Cloud Native Computing Foundation aimed to standardize the way that application telemetry data is recorded and utilized by platforms downstream. This application trace data can be valuable for application owners to understand the relationship between the components and services in their code, the request volume and latency introduced in each step, and ultimately where the bottlenecks are that are resulting in poor user experience. Python is one of the many languages that OpenTelemetry supports, and Flask is a popular lightweight framework that is used to create web applications. Below we will cover the steps involved to instrument a basic Flask application.

You can read more about OpenTelemetry here.

Installation

Prerequisites:

Python installed (3.7+).
Your selected IDE (like PyCharm)
Once you have the IDE, choose your project (or create a new one if you’re just following along). I named mine “instrument-flask-app”.
virtualenv — if you’re using PyCharm you don’t have to worry about this.

Initialize the Project:

Install the following libraries:

pip install flask
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-opentelemetry-instrumentation-flask
pip install opentelemetry-exporter-otlp

Create a file “app.py” under the root project directory “instrument-flask-app”:

Instrument-flask-app
	|___ app.py

Import the following libraries:

from flask import Flask  			
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider		
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.sdk.trace.export import BatchSpanProcessor,          ConsoleSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import   OTLPSpanExporter

Create a Flask app instance:

app = Flask(__name__)

Construct trace provider:

trace.set_tracer_provider(TracerProvider())

Init span exporter:

The exporter is the component in SDK responsible for exporting the telemetry signal (trace) out of the application to a remote backend, log to a file, stream to stdout. etc. In this example, we are creating a gRPC exporter to send out traces to an OpenTelemetry receiver backend running on localhost.

trace.get_tracer_provider().add_span_processor(BatchSpanProcessor
(OTLPSpanExporter(endpoint=os.environ.get("LM_OTEL_ENDPOINT"),insecure=True)))

Note: Optionally, you can also print the traces emitted from the application on the console by doing the following:

trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

Set up the OTEL collector endpoint:

You will have to set the environment variable LMOTEL_ENDPOINT which is the endpoint of the OTEL collector. The traces would be emitted by following:

export  LM_OTEL_ENDPOINT=http://<HOSTNAME>:<PORT>

Create resource detector:

The resource describes the object that generated the Telemetry signals. Essentially, it must be the name of the service or application. In LogicMonitor, you can use these attributes to map to a device or resource that you’re already monitoring within LogicMonitor.

Service.namespace: Used to group the services. For example, you can use this to distinguish services across environments like QA, UAT, PROD.
Service.name: The logical name of the service.
Host.name: Name of the host where the service is running.

Set the environment variable as the following:

export OTEL_RESOURCE_ATTRIBUTES=service.namespace=opentelemetry,service.name=instrument-flask-app,host.name=localhost

Auto-instrumenting the Flask app object:

FlaskInstrumentor().instrument_app(app)

Define an endpoint to test the instrumentation:

@app.route("/hello") 
def hello():  
   return "Hello World!"

Running the Flask app:

Flask apps run on port 5000 by default.

if __name__ == "__main__":  	# on running python app.py
        app.run()  			# run the flask app

Putting it together:

# app.py

from flask import Flask
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter importOTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor,ConsoleSpanExporter

app = Flask(__name__) 
trace.set_tracer_provider(TracerProvider())

# To print the traces on the console
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))

# To send the traces to the configured OTEL collector
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(
OTLPSpanExporter(endpoint=os.environ.get("LM_OTEL_ENDPOINT"), insecure=True)))

@app.route("/hello")  
def hello():  
       return "Hello World!"  


if __name__ == "__main__":  
   app.run()

Test our application

Let’s test our application by running the Python app. You should the server started as below:

python app.py

Success!

You should be able to see the traces on the LogicMonitor portal after hitting the endpoint (“http://127.0.0.1:5000/hello”) from your browser:

You can view additional information by clicking on the trace:

Additional information of a trace in LogicMonitor.

The traces on the console would look like this:

Next Steps

Congratulations, you have instrumented a flask-based Python application emitting traces using the OpenTelemetry Protocol (OTLP) Specification! Now, you’ll have visibility into your application and be able to address any potential bottlenecks present in your application code. If you’re already using LogicMonitor to monitor your infrastructure or collect logs, you’ll be able to associate these traces for faster troubleshooting.

If you’re not already sending in logs, try it out here to further enrich how much context you can bring into LogicMonitor.

External Resources

We recommend the following resources to learn more about OpenTelemetry and CNCF:

Join their Slack community here.
Check out their OTel documentation here.
Read more or leave feedback in the OTel Python repo here.

Check back for more blogs covering steps for distributed tracing with OpenTelemetry standards across other application languages.

Without hybrid observability powered by AI, it’s stressful not having complete visibility into your application. Plus, it contributes to risky deployments. Yet we hear that many developers have poor visibility into what powers production code. Without transparency into their apps, developers cannot see:

How code connects to underlying infrastructure and cloud services.
What is causing errors within code.
What users are experiencing along the journey.

You can try navigating tickets, permissions, and dashboards that don’t tell the right story, but there are ways to solve this problem.

Here are some Hybrid Observability best practices your team should implement to improve visibility into your application’s code.

Step 1: Find the Right Monitoring Solution

Traditional application monitoring tools tend to present issues and performance from the app only – ignoring the underlying on-prem or cloud infrastructure resources and severely limiting developers visibility into troubleshooting when problems arise.

So, the first thing devs can do to achieve observability within your team is to confirm you’re using the right solution.

Here are a few simple questions to gauge whether or not your monitoring solution is hindering your ability to ship code with confidence:

Can you see the impact of your software releases within the monitoring tool?
Does your monitoring tool provide code-level visibility? Do you know if the error exists within your application, infrastructure, network, or storage?
Can your team use the tool on its own to be successful? Or does it require asking another team for permission to make changes?
Does the tool show how users are experiencing the issue? Do you measure against established baselines? Without this, it will be challenging to prioritize the right bugs.

A monitoring solution that works WITH your team, not against, is just one way to improve visibility into a product.

Step 2: Manage Alerts Before the User Experience Suffers

We understand that receiving too many alerts can be not only disruptive and time-consuming, but it takes your attention away from what’s really important – solving issues before it’s too late. Triaging alerts helps you become aware of the most critical issues for smooth operations and user experience.

In order to separate the business-critical alerts from the rest, you shouldn’t have to re-instrument your metrics within your monitoring solution. Make sure you can easily display metrics already being tracked. With this in mind, pick one or two metrics that your users deeply care about and focus alerts here. Here are a couple of examples:

95% latency is > 1s
Request failure rate is > 0.1%

Criteria like this let you know there is a serious impact on the user experience. Hook these alerts up to existing ticketing systems or page on-call teams to hop on mission-critical issues.

This proactive observability best practice will help minimize bad user experiences. After you know which metrics are worth monitoring, focus on which errors actually need attention.

Step 3: Understand Errors Before It’s Too Late

It’s easy to get overwhelmed with the urge to fix every error that may be impacting the user experience. You must understand the critical errors, but keep an eye on anomalies and minor errors before they bubble up.

Investing in an error-monitoring tool will help identify non-critical errors so your team can focus more time on addressing the priority issues and reducing the risk of a severe incident.

This is what observability is about – understanding your application from its internal signals to know how the system works and not a fantasy that looks great on dashboards.

Step 4: Follow the Dots for Faster Troubleshooting

When an incident occurs, you need to know where to start, especially if it spans different calls, systems, and microservices.

Here is how to get started at troubleshooting in a situation full of unknowns:

Lean into auto-instrumentation to record traces that capture real-time events across the application.
Follow the traces to underlying logs to better understand an issue.
See interaction patterns between microservices and application components.
Correlate traces to monitored infrastructure resources and network calls.

Satisfy these use cases with a centralized observability platform with these use cases in mind.

Achieve Hybrid Observability Powered by AI with LogicMonitor

At LogicMonitor, we expand what’s possible for businesses through monitoring and observability software. LogicMonitor seamlessly monitors everything from networks to applications to the cloud, empowering developers to focus less on problem-solving and more on innovation. Our cloud-based platform helps you see more, know more, and do more.

Modern software development is evolving rapidly, and while the latest innovations allow companies to grow through greater efficiency, there is a cost. Modern architectures are incredibly complex, which can make it challenging to diagnose and rectify performance issues.

Once these issues affect customer experience, the consequences can be costly.

So, what is the solution?

Observability — which provides a visible overview of the big picture. This is achieved through metrics, logs, and traces, with a core focus on distributed tracing.

Whether you are a DevOps engineer, software leader, CEO, or business leader, understanding the role that distributed tracing plays, and when to use it, can give you that competitive edge you seek.

What Is Distributed Tracing?
The History of Distributed Tracing
How Does Distributed Tracing Work?
What Does Distributed Tracing Mean Within Observability?
Why Does Distributed Tracing Matter?
Why Seek Distributed Tracing Solutions?
What Is Sampling?
Key Terms to Know for Distributed Tracing

What Is Distributed Tracing?

Distributed tracing is an emerging DevOps practice that makes it easier to debug, develop, and deploy systems.

Distributed tracing involves the operating and monitoring of modern application environments. As data moves from one service to another, distributed tracing is the capacity to track and observe service requests. The goal here is to understand the flow of requests so that you can pinpoint weak spots in the system, such as failures or performance issues.

This flow of requests is through a microservices environment or distributed architecture, which we will touch on momentarily.

As discussed, traces go hand-in-hand with two other essential types of data— metrics and logs. Together, these three types of telemetry data provide you with a complete picture of your software environment and more importantly, how it is performing.

The History of Distributed Tracing

Monolithic service architectures were once the golden standard, but are now becoming increasingly rare.

Monolithic applications have evolved into microservices. When it comes to a traditional monolithic application, that contains a centralized codebase in a single service, diagnosing a failure can be as simple as following a single stack trace. To troubleshoot an error, you would simply look through your log messages or implement standard APM tools. However, this type of architecture makes scaling software development tedious and painstaking.

As technology evolved, distributed architectures, such as microservices, were developed to provide better communication between different platforms and more user-friendly systems. An increase in efficiency and productivity resulted.

Since a distributed architecture or microservices-based application can consist of tens, hundreds, or even thousands of services running across different hosts, a deeper, more complex level of telemetry is required to make sense of relationships — relationships you may not even be aware of.

The benefits of developing apps using microservices are vast. This approach involves smaller services that can be more easily deployed, scaled, and updated. This provides greater flexibility in terms of the technologies and frameworks you use for each component.

Although many real-world environments still use a combination of monolith apps alongside microservices-based apps, there has been a dramatic shift. As of 2018, research shows that 63% of traditional enterprises were using microservices architecture, experiencing:

Improved employee efficiency.
Improved end-user experience.
Cost savings on both infrastructure and development tools.

This research was based on the views of engineering managers, software architects, and other applications development experts across 51 countries and twelve industries.

Another 2020 report found that 92% of organizations are experiencing success with microservices. This means that if you are not leveraging the benefits of microservices, you risk being left behind.

Today, modern software solutions are typically implemented as large-scale, complex distributed systems. For example, using the microservice architectural style. These modern applications are often developed by different teams and may even use different language programs. As these applications evolved, companies realized they needed a way to view the entire request flow — not just individual microservices in isolation.

Based on the research above, lack of visibility into the end-to-end process, across multiple microservices and communication between teams are some of the top challenges that companies using microservices expect to face. The ultimate goal is to handle errors throughout the process reliably and consistently.

This is where distributed tracing came into play, becoming a best practice for optimal visibility and control. Distributed tracing tackles common issues, such as difficulty tracking and analyzing requests in a distributed environment. Debugging these issues is also fairly straightforward thanks to tracing — even when systems are highly distributed.

At first, this concept was painfully time-consuming. Being able to collect and visualize data was incredibly labor-intensive. The number of resources that were being spent on tracing was taking away from the development of new features and the overall growth of a company. The development of tools was needed to properly support distributed architectures, leading to a growing reliance on distributed tracing. This includes data tracing tools.

Distributed tracing took the concept of tracing, which was used to provide the point of view of a single monolithic process, extending it to the point of view of all the processes involved in the journey of a single request.

Luckily, companies such as LogicMonitor began offering a fully automated approach. This made it possible for companies to implement solutions that enhance the productivity of tracing, analyzing, and visualizing available data. Being able to identify where issues occur in an application is a game-changer.

The Relationship Between OpenTelemetry and Distributed Tracing

In 2019, OpenTracing and OpenCensus merged into OpenTelemetry.

OpenTelemetry offers a single, open-source standard to capture and export logs, metrics, and traces from your infrastructure and cloud-native applications. Providing a specification in which all implementations should follow, including a common set of SDKs and APIs, OpenTelemetry can help with distributed tracing.

While distributed tracing has been around for over a decade, it was only recently that interoperability between distributed tracing systems became widely adopted. It is OpenTelemetry that created interoperability between OpenCensus and OpenTracing.

LogicMonitor’s Distributed Tracing is an OpenTelemetry-based integration that allows you to forward trace data from instrumented applications. This allows you to monitor end-to-end requests as they flow through distributed services. Learn more about installing an OpenTelemetry Collector.

How Does Distributed Tracing Work?

Think of distributed tracing as a tree-like structure. A root or “parent” span, branches off into “child” spans.

Once an end-user interacts with an application, tracing begins. When an initial request is sent, such as an HTTP, that request is given a unique trace ID. That trace will describe the entire journey of that single request. As the request moves through the system, operations or “spans” are tagged with that initial trace ID. It is also assigned a unique ID. There will also be an ID associated with the operation that originally generated the current request — also known as the “parent” span.

Traces shown in the LogicMonitor platform

Each of the spans represents a single step within the request’s journey and is encoded with critical data, including everything from tags to query and detailed stack traces to logs and events that provide context. This means that as a trace moves through a distributed system, the platform will generate child spans for each additional operation needed along the way.

By using tools, such as LogicMonitor’s Distributed Tracing, you can better visualize data, monitoring end-to-end requests.

There is a multi-step process involved, which includes:

Instrumentation
Trace context
Metrics and metadata
Analysis and visualization

Based on this process, distributed tracing can offer real-time visibility into an end user’s experience.

In summary:

Once a trace is started, it is assigned a trace ID that follows the request wherever it goes.
A new span is created following each chunk of work in the request that contains the same trace ID, a new trace ID, and the parent span ID.
Once a parent span ID is created, a parent-child relationship is formed between spans. A child span will have one parent span, while a parent can have multiple child spans.
Spans are also time-stamped so that they can be placed in a timeline.
One span in a trace tree will not have a parent, which is referred to as the root span.

What Does Distributed Tracing Mean Within Observability?

As the use of microservices and cloud-based applications increases, the need for observability is more important than ever before. This is accomplished by recording the system data in various forms.

Above, we discussed metrics, logs, and traces. Together, these are often referred to as the pillars of observability. When understood, these strategies allow you to build and manage better systems.

Tracing is one critical component within this overall strategy, representing a series of distributed events through a distributed system. Traces are a representation of logs, providing visibility on the path a request travels, as well as the structure of a request.

Tracing is the continuous monitoring of an application’s flow, tracking a single user’s journey through an app stack. Distributed request tracing has evolved from the method of observability, ensuring cloud applications are in good health. Distributed tracing involves the process of following a request by recording the data associated with the path of microservices architecture. This approach provides a well-structured format of data tracing that is leveraged across various industries, helping DevOps teams quickly understand the technical glitches that disrupt a system infrastructure.

Again, this relates to the use of tools such as OpenTelemetry.

Why Does Distributed Tracing Matter?

The development of new technologies and practices allows businesses to grow more rapidly than ever before. However, as variables such as microservices, containers, DevOps, serverless functions, and the cloud gain velocity, making it easier to move from code to production, this creates new challenges.

The more complex software is, the more potential points of failure there will be within an application stack, leading to an increased mean time to repair (MTTR) — which is the average time it takes from when an issue is reported until that issue is resolved. As complexity increases, there is also less time to innovate because more time and energy is spent on diagnosing issues.

Making distributed tracing part of your end-to-end observability strategy, software teams can operate much more effectively. In the process, teams can:

Better identify and repair issues to improve customer experience and business outcomes.
More effectively measure system health to continuously make positive changes to the customer experience.
Innovate to rise above competitors.

Why Seek Distributed Tracing Solutions?

Bringing coherence to distributed systems is the primary benefit of distributed tracing. However, this leads to a list of other benefits, directly affecting a company’s bottom line.

Increased productivity — Compared to monolithic applications, tracking and resolving issues associated with a microservice architecture can be time-consuming and expensive. Failure data also requires developers to decipher issues from messages that aren’t always clear. Distributed tracing reduces the diagnostic and debugging process through a more accurate, holistic view. This significantly boosts efficiency.
Implementation is flexible — Since distributed tools work across a variety of programming languages and applications, developers can implement these tools into practically any microservices system, viewing critical data through one tracing application.
Improved collaboration — A specialized team develops each process within a microservice environment. When there is an error, it’s challenging to determine where that error occurred and who is responsible. By eliminating data silos and the issues they create, cross-team collaboration improves, as does response time.
Reduces MTTR — If a customer reports a failure, distributed traces can be used to verify that it is a backend issue. Using the available traces, engineers can then quickly troubleshoot. Frontend performance issues could also be analyzed using a distributed tracing tool.

So, the reasons why you would seek distributed tracing solutions are clear — but how?

Even if there is only a handful of services, the amount of data can become overwhelming, and fast. Sifting through traces is much like finding a needle in a haystack. To properly observe, analyze, and understand every trace in real-time, you need a distributed tracing tool. This will highlight the most useful information in terms of where you should take action.

What Is Sampling?

Sampling comes into play based on the sheer volume of trace data that collects over time. As more microservices are deployed and the volume of requests increases, so does the complexity and cost of storing/transmitting that data. Instead of saving all the data, organizations can store samples of data for analysis.

There are two approaches to sampling distributed traces.

Head-Based Sampling

When processing large amounts of data, distributed tracing solutions often use head-based sampling. This involves randomly selecting a trace to sample before it has finished its path. This means that sampling begins when a single trace is initiated. That data will either be kept or discarded and is the preferred form of sampling based on its simplicity.

There are many advantages to head-based sampling, as it is fast and simple, has little-to-no effect on application performance, and is ideal for blended monolith and microservices environments. The disadvantage is that since data is selected randomly, valuable information may be lost because there is no related trace data or context.

Tail-Based Sampling

Based on the limitations of head-based sampling, tail-based sampling is best for high-volume distributed systems where every error must be observed. In this case, 100% of traces are analyzed, with sampling taking place after traces have fully completed their path. This allows you to pinpoint exactly where issues are, solving the whole needle in a haystack concern.

Tail-based sampling leads to more informed decisions, especially for latency measurements. Since a sampling decision is made at the end of the workflow, it is easier to visualize errors and performance issues. The downside to this option is that it can cost more to transmit and store large amounts of data and additional gateways and proxies will be required.

Key Terms to Know for Distributed Tracing

Here is a short glossary of distributed tracing terminology:

A request is how applications, microservices, and functions communicate with one another.
A span is the segments that make up a portion of a trace.
A root span is the first span in a trace, which is also known as the parent span.
A child span is what follows the root or parent span.
A trace is the performance data that flows through microservices.
Sampling involves storing samples of tracing data instead of saving them for analysis.
Head-based sampling involves the decision to collect and store trace data randomly while the root span is being processed.

Distributed tracing tells the story of what has happened in your systems, helping you quickly deal with unpredictable problems. As the future of technology and software becomes increasingly complex, the use of distributed tracing, as well as techniques like monitoring metrics and logging, will become increasingly essential. It’s time to rise to the challenge.

In this blog series, we are covering application instrumentation steps for distributed tracing with OpenTelemetry standards across multiple languages. Earlier, we covered Java Application Manual Instrumentation for Distributed Traces, Golang Application Instrumentation for Distributed Traces, and DotNet Application Instrumentation for Distributed Traces. Here we are going to cover the instrumentation for NodeJS.

Initialize the New Project

As all NodeJS projects start with “npm init”, we will do the same for our project. But before that, we will create a separate directory.

Go ahead, make a directory for the project and change it to that directory.

mkdir "InstrumentationProject"
cd "InstrumentationProject"

Now that we have our directory, let’s initialize our nodeJS project.

npm init esm -y

This will initiate the project with ES-6 modules and create package.json. Next, let’s grab our dependencies.

npm i @opentelemetry/sdk-trace-base @opentelemetry/api @opentelemetry/resources
@opentelemetry/exporter-collector
@opentelemetry/semantic-conventions

At this point, you will have all of the dependencies installed. Create a file named “tracer.js”. Your folder structure should look like this:

You want your tracer.js to be run before anything else in your application so that all the required components for tracing are configured before an application starts taking requests.

To achieve this, edit index.js and add before module.exports.

require('./tracer')

Now we are ready to start initializing our Tracer Provider.

Initializing Tracer Provider

In this section, we will be editing tracer.js. Here are the required imports:

import { BasicTracerProvider, ConsoleSpanExporter, SimpleSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { CollectorTraceExporter } from "@opentelemetry/exporter-collector";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";

Let’s get an instance of BasicTracerProvider:

const provider = new BasicTracerProvider();

Set Resource Attributes

The resource attributes are used to describe the resource that is generating the telemetry data.

For example:

Name of the service
Deployment zone
Cloud provider
Type of host (VM, Container, Kubernetes)

Let us define a few resource attributes:

const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'authentication-service',
    [SemanticResourceAttributes.SERVICE_NAMESPACE]: 'US-WEST-1',
    'host.type': 'VM'
})

You can describe the resource attributes using the predefined keys in SemanticResourceAttributes or you can use custom attribute keys.

Go ahead and configure our newly defined resource attributes with the Tracer:

provider.resource = resource

Configuring Trace/Span Exporter

The exporter is a component that sends the collected telemetry data to a remote backend, locally running collector, or to a file. We can also export the collected data to the console, as we will see later.

Let’s initialize the exporters we will be using in this example:

const consoleSpanExporter = new ConsoleSpanExporter
const collectorTraceExporter = new CollectorTraceExporter

consoleSpanExporter: This is used to see the spans on the console.

collectorTraceExporter: This is used to export the traces in OpenTelemetry format.

Adding the Span Processor and Registering the Tracer

provider.addSpanProcessor(new SimpleSpanProcessor(consoleSpanExporter))
provider.addSpanProcessor(new SimpleSpanProcessor(collectorTraceExporter))
provider.register()

We are using SimpleSpanProcessor. You can also use BatchSpanProcessor, which sends the spans in batches for efficient use of system resources. You can customize the batch size.

Finally, we register our Tracer so that Opentelemetry APIs can use this tracer.

Instrumenting the Application

In this section, we will get to know the various aspects of instrumenting an application. You can find the code of the instrumented application towards the end.

A span is a single unit of work and traces are often made up of several spans. To enrich the span with more information about the operation, we leverage Span Attributes.

Creating a New Span

Here, ‘parent’ is the name that we want to give to the span.

const parentSpan = opentelemetry.trace.getTracer('default').startSpan('parent');

Adding Attributes to the Span

You can add any number of attributes to the span.

parentSpan.setAttribute("microservice","server")
parentSpan.setAttribute("prodEnv","true")
childFunction(parentSpan) //Passing the context the function
parentSpan.end() //DO NOT forget to end the span

Creating the child span using parent span in the childFunction:

function childFunction(parentSpan) {

const ctx = opentelemetry.trace.setSpan(opentelemetry.context.active(), parentSpan);
const childSpan = opentelemetry.trace.getTracer('default').startSpan('child', undefined, ctx);
.
.
.
.
childSpan.end()
}

Setting a Span’s Status

By default, the status is UNSET. If you have encountered an error you can set the error status as:

childSpan.setStatus({
            code: SpanStatusCode.ERROR,
            message: 'Authentication failed.'
          })

Instrumented Application Code

index.js

require = require("esm")(module/* , options */)
require('./tracer')
module.exports = require("./main.js")

tracer.js

import { BasicTracerProvider, ConsoleSpanExporter, SimpleSpanProcessor } from "@opentelemetry/sdk-trace-base";
import { CollectorTraceExporter } from "@opentelemetry/exporter-collector";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";

const provider = new BasicTracerProvider();


const resource = new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'authentication-service',
    [SemanticResourceAttributes.SERVICE_NAMESPACE]: 'US-WEST-1',
    'host.type': 'VM'
})


const consoleSpanExporter = new ConsoleSpanExporter
const collectorTraceExporter = new CollectorTraceExporter

provider.resource = resource

provider.addSpanProcessor(new SimpleSpanProcessor(consoleSpanExporter))
provider.addSpanProcessor(new SimpleSpanProcessor(collectorTraceExporter))
provider.register()

main.js

import Auth from './auth'
import opentelemetry from "@opentelemetry/api"
console.log("Hello")

const server = async (username, password) => {
    
    const parentSpan = opentelemetry.trace.getTracer('default').startSpan('parent');
    
    parentSpan.addEvent("Parent Span")
    parentSpan.setAttribute("microservice","server")
    parentSpan.setAttribute("prodEnv","true")
    parentSpan.addEvent("Received request")
    console.log("Got request")
    await Auth(parentSpan, username,password)

    console.log("SERVER END")
    parentSpan.end()
};

server("user1","password");
server("wrongUser","wrongPassword");

export {}

auth.js

import opentelemetry, { SpanStatusCode } from "@opentelemetry/api"

export default async function(parentSpan, username,password) {
    const ctx = opentelemetry.trace.setSpan(opentelemetry.context.active(), parentSpan);
    const childSpan = opentelemetry.trace.getTracer('default').startSpan('child', undefined, ctx);
    childSpan.setAttribute("user",username)
    childSpan.setAttribute("microservice","DB-Service")
    
    //Simulate a network delay
    await sleep(Math.floor(
        Math.random() * (500 - 200) + 200
      ))
    try{
        if (username==="user1" && password==="password") {
                
            //Success span
            console.log("Authenticated.")
            childSpan.addEvent("Authentication Failed", { authentication: "Successful" }  )

        } else {
            
            throw("Authentication Failed Exception")
            //Error Span

        }
    }
    catch(error) {
        console.log("Failed to Authenticate.")
        childSpan.recordException(error)
        childSpan.setStatus({
            code: SpanStatusCode.ERROR,
            message: 'Authentication failed.'
          })
    }
    finally {
        childSpan.end()
    }
}


function sleep(ms) {
  return new Promise((resolve) => {
    setTimeout(resolve, ms);
  });
}

Running the Application

From your project directory:

node index.js

Traces Received in LogicMonitor Platform

Detailed View of Trace

Parent Span:

Child Span:

Conclusion

Congratulations, you have just written a NodeJS application emitting traces using the OpenTelemetry Protocol (OTLP) Specification. Feel free to use this code as a reference when you get started with instrumenting your business application with OTLP specifications. LogicMonitor APM specification is 100% OTLP compliant with no vendor lock-in. To receive and visualize traces of multiple services for troubleshooting with the LogicMonitor platform, sign up for a free trial account here. Check back for more blogs covering application instrumentation steps for distributed tracing with OpenTelemetry standards across multiple languages.

Manual instrumentation provides enhanced insight into the operations of distributed systems. By instrumenting your Java applications manually, you gain greater control over the data you collect, leading to improved visibility across your distributed architecture.

In this blog series, we are covering application instrumentation steps for distributed tracing with OpenTelemetry standards across multiple languages. Earlier, we covered Golang Application Instrumentation for Distributed Traces and DotNet Application Instrumentation for Distributed Traces. Here we are going to cover the instrumentation for Java.

Exploring OpenTelemetry concepts

OpenTelemetry is a set of libraries, APIs, agents, and tools designed to capture, process, and export telemetry data—specifically traces, logs, and metrics—from distributed systems. It’s vendor-neutral and open-source, which means your business has interoperability and freedom of choice to implement observability systems across a wide range of services and technologies.

You can break OpenTelemetry down into a few main concepts: signals, APIs, context and propagation, and resources and semantic conventions.

Signals

Signals in OpenTelemetry are traces, metrics, and logs. Traces represent the end-to-end latency in your operation across services. They are composed of spans, which are named individual units of work with start and end timestamps and contextual attributes.

Metrics are the qualitative measurements over time (CPU usage, memory usage, disc usage) that help you understand the overall performance of your application. Logs, on the other hand, are records of events that occur on systems that provide insights into errors and other events.

APIs

OpenTelemetry defines a language-agnostic API that helps teams create code that implements the API to collect and process data and export it to their chosen backends. The API allows anyone to collect the same data, whether using custom software or an out-of-the-box monitoring solution, allowing them to process data on their own terms and tailor a monitoring solution based on their needs.

Context and propagation

Context is a concept used to share data (like span context) between code and networks. Context propagation ensures that distributed traces stay connected as requests travel across networks through different services—helping teams get a holistic view across the entire infrastructure.

Resources and semantic conventions

A resource is what provides information about the entity producing data. It contains information like the host name, device environment, and host details. Semantic conventions are the standardized attributes and naming conventions that make telemetry data more consistent and allow any environment to uniformly interpret the data without worrying about variations in data output.

Understanding these concepts will help you decipher telemetry output and get started with your OpenTelemetry projects. So, let’s start by setting up a new project.

Custom instrumentation and attributes

Custom instrumentation in Java applications allows developers to capture more granular telemetry data beyond what automatic instrumentation provides. By manually defining spans and adding attributes, teams can gain deeper insights into specific application behaviors and business logic within a distributed system.

Adding attributes to spans

Attributes are key-value pairs attached to spans, providing contextual metadata about an operation. These attributes can include details such as user IDs, transaction types, HTTP request details, or database queries. By adding relevant attributes, developers can enhance traceability, making it easier to filter and analyze performance data based on meaningful application-specific insights.

Creating Multi-Span Attributes

Multi-span attributes allow developers to maintain consistency across spans by propagating key metadata across multiple operations. This is especially useful when tracking a request across services, ensuring that relevant information, such as correlation IDs or session details, remains linked throughout the trace.

Initialize New Project

To begin, create a new Java project and add the below dependencies that are required for OpenTelemetry manual instrumentation.

Maven

<project>
  <dependencyManagement>
    <dependencies>
      <dependency>
        <groupId>io.opentelemetry</groupId>
        <artifactId>opentelemetry-bom</artifactId>
        <version>1.2.0</version>
        <type>pom</type>
        <scope>import</scope>
      </dependency>
    </dependencies>
  </dependencyManagement>
  <dependencies>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-api</artifactId>
    </dependency>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-sdk</artifactId>
    </dependency>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-exporter-otlp</artifactId>
    </dependency>
    <dependency>
      <groupId>io.opentelemetry</groupId>
      <artifactId>opentelemetry-semconv</artifactId>
      <version>1.5.0-alpha</version>
    </dependency>
    <dependency>
      <groupId>io.grpc</groupId>
      <artifactId>grpc-netty-shaded</artifactId>
      <version>1.39.0</version>
    </dependency>
  </dependencies>
</project>

Gradle

dependencies {
  implementation platform("io.opentelemetry:opentelemetry-bom:1.2.0")
  implementation('io.opentelemetry:opentelemetry-api')
  implementation('io.opentelemetry:opentelemetry-sdk')
  implementation('io.opentelemetry:opentelemetry-exporter-otlp')
  implementation('io.opentelemetry:opentelemetry-semconv:1.5.0-alpha')
  implementation('io.grpc:grpc-netty-shaded:1.39.0')
}

It is recommended to use OpenTelemetry BOM to keep the version of the various components in sync.

If you are developing a library that is going to be used by some other final application, then your code will have dependency only on opentelemetry-api.

Create Resource Detectors

The resource describes the object that generated the Telemetry signals. Essentially, it must be the name of the service or application. OpenTelemetry has defined the standards to describe the service execution env, viz. hostname, hostType (cloud, container, serverless), namespace, cloud-resource-id, etc. These attributes are defined under Resource Semantic Conventions or semconv.

Here we will be creating a resource with some environmental attributes.

Attribute	Description	Required
service.name	It is the logical name of the service.	Yes
service.namespace	It is used to group the services.For example, you can use service.namespace to distinguish services across environments like QA,UAT,PROD.	No
host.name	Name of the host where the service is running.	No

//Create Resource
AttributesBuilder attrBuilders = Attributes.builder()
   .put(ResourceAttributes.SERVICE_NAME, SERVICE_NAME)
   .put(ResourceAttributes.SERVICE_NAMESPACE, "US-West-1")
   .put(ResourceAttributes.HOST_NAME, "prodsvc.us-west-1.example.com");
 
Resource serviceResource = Resource
   .create(attrBuilders.build());

Init Span Exporter

The exporter is the component in SDK responsible for exporting the Telemetry signal (trace) out of the application to a remote backend, log to a file, stream to stdout., etc.

Consider how distributed tracing impacts system performance. Proper trace sampling can help balance the need for detailed traces with overall system efficiency, preventing performance slowdowns or data overload.

In this example, we are creating a gRPC exporter to send out traces to an OTLP receiver backend running on localhost:55680. Possibly an OTEL Collector.

//Create Span Exporter
OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
   .setEndpoint("http://localhost:55680")
   .build();

Construct TracerProvider and Configure SDK

Using TracerProvider you can access Tracer, a key component in Java performance monitoring, that is used to create spans and track performance metrics.

//Create SdkTracerProvider
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
   .addSpanProcessor(BatchSpanProcessor.builder(spanExporter)
       .setScheduleDelay(100, TimeUnit.MILLISECONDS).build())
   .setResource(serviceResource)
   .build();
 
//This Instance can be used to get tracer if it is not configured as global
OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
   .setTracerProvider(sdkTracerProvider)
   .buildAndRegisterGlobal();

You need to configure the SDK and create the tracer as a first step in your application.

With the right configuration in place, developers can monitor their application’s performance in real-time. This enables quick adjustments and optimization, allowing you to address issues or enhance performance as soon as they arise.

Create Tracer

Tracer tracer= GlobalOpenTelemetry.getTracer("auth-Service-instrumentation");
//Tracer tracer= GlobalOpenTelemetry.getTracer("auth-Service-instrumentation","1.0.0");
 
//OR use the OpenTelemetry instance from previous step to get tracer
//openTelemetry.getTracer("auth-Service-instrumentation");

You can use GlobalOpenTelemetry only If your OpenTelemery instance is registered as global in the previous step or else you can use the OpenTelemetry instance returned by SDK builder.

The getTracer method requires an instrumentation library name as a parameter, which must not be null.

Using GlobalOpenTelemetry is essential for tracing intricate processes across multiple services. By enabling this, you streamline the tracing of multi-step workflows and boost overall operational efficiency, ensuring smooth and optimized system performance.

Creating and managing spans

Creating and managing spans efficiently is the next step after setting up your OpenTelemetry instrumentation. Properly defining, structuring, and annoying spans will help you understand how your operations flow through your system and help when troubleshooting problems.

A few things help make good spans: span attributes, child spans, and events.

Span attributes: Span attributes help assign meaning to your spans. They distinguish one operation from another and provide valuable metadata for downstream analysis tools. Use attributes to represent business priorities, environmental details, and user information. Remember to standardize attributes across environments using semantic conventions to ensure service consistency.
Child spans: More complex workflows require multiple steps and dependencies, which are hard to represent in a single span. Child spans allow you to break a single operation into sub-operations, making it easier to find delays and errors. Use these to create parent-child relationships, giving you a structured view of your data for faster troubleshooting.
Event logging: Events and logs allow you to record time-stamped data points or device internal state changes. Embedding events and logs means you don’t rely solely on external logging solutions and ensures all contextual information is tied directly to the specific operation within your trace data. This data will prove invaluable when diagnosing problems since it provides immediate context to performance and anomalies.

There are also a few best practices to consider to get the most out of your telemetry, some of which include:

Implement status and error handling (StatusCode.OK, StatusCode.ERROR) on spans to make it easier to identify problematic spans quickly
Since not all requests require full instrumentation, optimize your sampling strategy to balance performance and observability
Consider linking spans for spans that share a common relationship but aren’t necessarily a parent/child relationship to help correlate related but independently triggered spans
Make use of OpenTelemetry’s Context and Scope management utilities to ensure spans are accessible in multi-threaded workflows where spans may not correctly propagate

Understanding these fundamentals will help your organization optimize your instrumentation to produce more meaningful telemetry. With that, let’s look at some examples of how to create and manage your spans effectively.

Troubleshooting common issues

Even with well-structured spans, OpenTelemetry instrumentation can sometimes present challenges. Some common troubleshooting techniques include:

Ensuring proper span propagation: When spans don’t appear in the expected traces, verify that context propagation is correctly implemented across service boundaries.
Checking exporter configurations: If traces are missing from your backend, confirm that your exporter settings are correctly configured, and ensure the application has network access to the telemetry endpoint.
Managing high latency in trace data: If traces are delayed or missing, consider adjusting your sampling rate to balance performance and data volume.
Handling incomplete spans: Ensure that spans are properly ended, especially in multi-threaded or asynchronous workflows, where spans may be lost due to improper scope management.

Alternative protocols for telemetry data transmission

By default, OpenTelemetry uses gRPC for exporting telemetry data. However, in some cases, HTTP-based transport methods can be a better alternative, especially when working with legacy systems, firewalls, or monitoring tools that do not support gRPC.

Create a Span and Define Span Attributes

The span is a single execution of an operation. It is identified by a set of attributes, which are sometimes referred to as span tags. Application owners are free to choose the attributes that can capture the required information for the spans. There is no limit to the number of span attributes per span.

In this example, we are defining two-span attributes for our sample applications.

Span parentSpan = tracer.spanBuilder("doLogin").startSpan();
parentSpan.setAttribute("priority", "business.priority");
parentSpan.setAttribute("prodEnv", true);

Create a Child Span

You can use the setParent method to correlate spans manually.

Span childSpan = tracer.spanBuilder("child")
   .setParent(Context.current().with(parentSpan))
   .startSpan();

The OpenTelemetry API also offers an automated way to propagate the parent span on the current thread.

Use the makeCurrent method to automatically propagate the parent span on the current thread.

try (Scope scope = parentSpan.makeCurrent()) {
   Thread.sleep(200);
   boolean isValid=isValidAuth(username,password);
   //Do login
 
} catch (Throwable t) {
   parentSpan.setStatus(StatusCode.ERROR, "Change it to your error message");
} finally {
   parentSpan
       .end(); // closing the scope does not end the span, this has to be done manually
}
 
//Child Method
private boolean isValidAuth(String username,String password){
 
   Span childSpan = tracer.spanBuilder("isValidAuth").startSpan();
   // NOTE: setParent(...) is not required;
   // `Span.current()` is automatically added as the parent
   childSpan.setAttribute("Username", username)
       .setAttribute("id", 101);
   //Auth code goes here
   try {
       Thread.sleep(200);
       childSpan.setStatus(StatusCode.OK);
   } catch (InterruptedException e) {
       childSpan.setStatus(StatusCode.ERROR, "Change it to your error message");
   }finally {
       childSpan.end();
   }
   return true;
}

Add Events/Logs to Spans

Spans can be enriched with some execution logs/events that happened during the execution of the span. This information will help provide contextual logs always tied up with the respective span.

Attributes eventAttributes = Attributes.builder().put("Username", username)
   .put("id", 101).build();
childSpan.addEvent("User Logged In", eventAttributes);

Putting It Together

TestApplication.java

package com.logicmonitor.example;
 
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.common.AttributesBuilder;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.resources.Resource;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;
import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;
import java.util.concurrent.TimeUnit;
 
public class TestApplication {
 
   private static final String SERVICE_NAME = "Authentication-Service";
   static {
       //Create Resource
       AttributesBuilder attrBuilders = Attributes.builder()
           .put(ResourceAttributes.SERVICE_NAME, SERVICE_NAME)
           .put(ResourceAttributes.SERVICE_NAMESPACE, "US-West-1")
           .put(ResourceAttributes.HOST_NAME, "prodsvc.us-west-1.example.com");
 
       Resource serviceResource = Resource
           .create(attrBuilders.build());
       //Create Span Exporter
       OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
           .setEndpoint("http://localhost:55680")
           .build();
 
       //Create SdkTracerProvider
       SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
           .addSpanProcessor(BatchSpanProcessor.builder(spanExporter)
               .setScheduleDelay(100, TimeUnit.MILLISECONDS).build())
           .setResource(serviceResource)
           .build();
 
       //This Instance can be used to get tracer if it is not configured as global
       OpenTelemetry openTelemetry = OpenTelemetrySdk.builder()
           .setTracerProvider(sdkTracerProvider)
           .buildAndRegisterGlobal();
   }
   public static void main(String[] args) throws InterruptedException {
       Auth auth = new Auth();
       auth.doLogin("testUserName", "testPassword");
       Thread.sleep(1000);
   }
}

Auth.Java

package com.logicmonitor.example;
 
import io.opentelemetry.api.GlobalOpenTelemetry;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.trace.Span;
import io.opentelemetry.api.trace.StatusCode;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.context.Scope;
 
public class Auth {
 
   Tracer tracer = GlobalOpenTelemetry.getTracer("auth-Service-instrumentation");
 
   //Tracer tracer= GlobalOpenTelemetry.getTracer("auth-Service-instrumentation","1.0.0");
   public void doLogin(String username, String password) {
       Span parentSpan = tracer.spanBuilder("doLogin").startSpan();
       parentSpan.setAttribute("priority", "business.priority");
       parentSpan.setAttribute("prodEnv", true);
 
       try (Scope scope = parentSpan.makeCurrent()) {
           Thread.sleep(200);
           boolean isValid = isValidAuth(username, password);
           //Do login
 
       } catch (Throwable t) {
           parentSpan.setStatus(StatusCode.ERROR, "Change it to your error message");
       } finally {
           parentSpan
               .end(); // closing the scope does not end the span, this has to be done manually
       }
 
   }
 
   private boolean isValidAuth(String username, String password) {
 
       Span childSpan = tracer.spanBuilder("isValidAuth").startSpan();
       // NOTE: setParent(...) is not required;
       // `Span.current()` is automatically added as the parent
 
       //Auth code goes here
 
       try {
           Thread.sleep(200);
           childSpan.setStatus(StatusCode.OK);
           Attributes eventAttributes = Attributes.builder().put("Username", username)
               .put("id", 101).build();
           childSpan.addEvent("User Logged In", eventAttributes);
       } catch (InterruptedException e) {
           childSpan.setStatus(StatusCode.ERROR, "Change it to your error message");
       } finally {
           childSpan.end();
       }
       return true;
   }
}

Run the Application

Run TestApplication.java.

Traces Received in the LogicMonitor Platform

Detailed View of the Trace

Parent Span:

Child Span:

Conclusion

Congratulations, you have just written a Java application emitting traces using the OpenTelemetry Protocol (OTLP) Specification. Feel free to use this code as a reference when you get started with instrumenting your business application with OTLP specifications. LogicMonitor APM specification is 100% OTLP compliant with no vendor lock-in. To receive and visualize traces of multiple services for troubleshooting with the LogicMonitor platform, sign up for a free trial account here. Check back for more blogs covering application instrumentation steps for distributed tracing with OpenTelemetry standards across multiple languages.

Distributed tracing plays a crucial role in maintaining system stability and minimizing service disruptions. By monitoring traces across various components, you can ensure more reliable operation and higher uptime, even in complex environments. Unlock the full potential of distributed tracing with LogicMonitor’s powerful monitoring platform.

The pandemic has created a unique set of circumstances that have accelerated what was already a growing trend. The shift from brick and mortar retail to a hybrid online and in-person retail experience has meant that nearly every retailer must also be an e-tailer and deliver a flawless digital shopping experience for its customers.

The rise of omnipotent global retail brands like Amazon who have defined the shopping experience but also set unrealistic expectations of how goods can be fulfilled has set the bar very high for other retailers. A digital business relies on smooth and responsive technology. A slow response from a site or an app can directly lead to lost customers. Outages of your website or your fulfillment systems can lead to lost sales and delays which impact your reputation.

Thankfully, technology is keeping pace with the expectations of the consumers with the reliance on things like Kubernetes to allow for agile development of systems and chatbots to handle costly first-line customer support. In this blog post, I will explore some of the challenges in the retail space and the trends in IT which are helping to address the challenges for retailers in this increasingly digital-first industry segment.

Trends in Retail Technology

Digital-First: Further Pivoting to E-commerce, Accelerated by COVID

COVID has accelerated what was an already significant trend as shoppers were unable to visit shops in 2020. Retailers must be increasingly innovative to keep customers in a very competitive marketplace dominated by Amazon and larger retailers. Having a strong online presence and mobile application will be fundamental to the success of the modern retailer.

AI-Driven Operations and Inventory Management

Just in time is no longer enough. Using AI to predict trends is becoming popular. Being able to predict seasonality or the latest fad is key to maintaining adequate stock levels. Using historical data to predict trends is the starting point but also learning from other trends to predict demand for new products or services is more challenging. Customer experience is already being dominated by the chatbot. Businesses will try to further leverage AI in all areas of business management from product design to marketing copy.

Additional Digital Services To Improve Customer Experience; Mobile Applications, Shared Marketplaces, and Payment Services

The world is now mobile and the vast majority of research and increasingly purchase decisions happen through a mobile device. Retailers have had to adapt already but are increasingly having to find new ways to engage customers. User experience is everything. Slow or buggy applications or online services could cost a retailer a customer. Finding better or novel ways for customers to check out, search your catalog, and be inspired by what other people are looking for will enhance the user experience and help retain customers. To compete with global corporations like Amazon, businesses are collaborating to provide their goods on shared marketplaces, improving customer experience and choice.

Net Zero Retail: Decarbonizing the Entire Supply Chain

Understanding the carbon footprint of scope three emissions (third-party) in a complex supply chain can be challenging. However, indirect emissions are still important to consider. Identifying energy efficiency opportunities is especially important for larger organizations that have a public mandate to be responsible for the way their corporation behaves. Retailers and e-tailers are focused on building a low carbon strategy and supplier engagement. Measuring your footprint is the first step. If you can measure it, you can make changes to lower your carbon emissions

Common Challenges Faced by Retail Organizations

How Do You Maintain Customer Experience in a Mobile World?

There is a myriad of different factors affecting a customer’s experience and their engagement with your online brand (website/app load/response times, checkout process, navigation and search, customer reviews, social media). Slight changes to load time or checkout experience can lose fickle customers quickly. The technology stack to deliver websites, applications, and in-store IT is pretty diverse and distributed. Continuous integration and continuous delivery (CI/CD) is a feature of a modern online business as companies test new features and functionality. Being able to add features, test spontaneously and then remove them is the Amazon way and is essential to being able to release new features which help and don’t hinder your customers.

How Do You Ensure Adequate Supply With Increasingly Complex Supply Chains?

The supply chain process is made more complex by a variety of delivery options (next day/same day), especially with perishable goods. Customer research online ahead of time leads to fierce price competition, which eats away margins. The number of interdependent factors associated with supply chain has a big impact on complexity (customer demand, seasonality, marketing, cash flow, different requirements of stakeholders in the supply chain).

Can You Be ‘Just in Time’ in a Dynamic Market?

Consumers have gotten used to being able to find what they need quickly. If you cannot supply what is needed, then your consumer will often look elsewhere. Forecasting effectively requires good data and experience and is key to making sure you can fulfill expected demand without sinking much money into stock. Just in time stock control is made easier by leveraging big data to analyze supply and demand and create forecasting models to predict both accurately.

The LogicMonitor Solution for the Retail Industry

From Logistics to E-commerce, Using Data To Drive Decision Making

Global Ecommerce/Logistics dashboard in LogicMonitor

LogicMonitor marries visibility into your traditional and cloud workloads with your IoT and productions systems to allow you to troubleshoot and optimize the system as a whole – from stock control to revenue data, all of this can be detailed in one system allowing the business to be more agile and make informed, data-based decisions.

Extensive Breadth of Coverage Across On-Premises Into Cloud and Containers

LogicMonitor integrations showing different network, computing, storage, cloud, container, app, logs and Cisco integrations

LogicMonitor has extensive coverage across the enterprise IT landscape and into the cloud with monitoring templates for everything from SD-WAN to containers. Often businesses are used to data silos, with teams needing their own specialist tools to monitor their specific technologies. LogicMonitor liberates silos by allowing every team to troubleshoot their problems in the same way with granular performance data for all types of infrastructure.

Easy Customization and Simplified Extensibility

LogicMonitor was designed by engineers for engineers and ease of customization is at the heart of the platform. The LM Exchange provides the perfect mechanism for the LogicMonitor team and our customers to share new monitoring templates. LogicMonitor simplifies the process of building custom monitoring templates with our rapid prototyping capabilities, allowing engineers to choose from a multitude of different mechanisms to extract monitoring metrics including our new ability to ingest metrics pushed from a monitored device. You don’t need to worry about building your own data sources if you don’t want to, we also have a large team of professional services engineers available to help users customize their monitoring based on needs.

LogicModules showing SSL, VMWare and Kubernetes

A Secure Platform

LogicMonitor’s platform is secure. The following are just some of the ways LogicMonitor ensures user and systems security.

Secure Architecture

RBAC, 2FA.
Encryption of data in transit and at rest.

Secure Data Collection

Only outbound communication is allowed from the LM Collector.
Data encrypted with TLS.
LM Collectors are securely locked to your environment.

Secure Operations

Collectors based on hardened Linux with perimeter and host-based IPS.
Operated out of top-tier DCs and AWS regions.
All with top security measures in place.

Secure Practices

Minimal personal data stored.
Device access credentials stored in memory and never written to disk.
Salted one-way hashes used in place of user passwords.

Secure Standards

Constant penetration testing ensures maximum security.
SOC2 validates our controls for security.
High availability and confidentiality.

In addition to LogicMonitor’s native platform, by leveraging a secure proxy, user operations teams will retain complete control of communications. This allows users to lock down traffic out of their networks to minimize exposure to external bad actors.

Identify Patterns and Anomalies With AIOps

Especially in the retail space, being able to analyze trends in demand can be the key to ensuring you have the right stock in place to fulfill customer needs. LogicMonitor uses industry-leading algorithms to analyze any metric ingested. Using LogicMonitor’s bespoke monitoring templates allows you to ingest revenue and stock data to identify patterns and anomalies.

LogicMonitor’s AIOps features intelligently detect service-impacting signals from noise, make signals more actionable, and reduce the flurry of false alerts that keep teams up at night. With alert escalation chains, users can ensure the right team members are informed via SMS, email, chat, or ITSM integrations. LogicMonitor’s AIOps capabilities enable teams to identify the root cause of an outage and put an end to alert storms with built-in dynamic thresholds and root cause analysis.

Contextual insight: AIOps driven forecasting & anomaly detection showing real time intelligence into customer's environment

One-Click-Observability™

Users are able to gain full observability in one click. The LogicMonitor platform can connect the dots between metrics, logs, and traces right out of the box.

Metrics – LogicMonitor’s agentless approach to monitoring metrics makes the discovery and application of monitoring templates very easy. With coverage for more than 2,700 different technologies spanning network, cloud, containers, and applications, LogicMonitor is arguably the most comprehensive monitoring platform in the market.
Logs – By aligning log data to metric data you remove the need to context switch between IT infrastructure monitoring and log management products. LogicMonitor correlates logs with metrics in a single platform industry-standard integrations (or via any custom log source).
Traces – The addition of trace data alongside metrics and logs means you never miss an application error. This insight allows you to improve code quality and diagnose and fix issues faster. With end-to-end tracing for your application, you get insight into the performance of the entire app stack, from code to cloud, to ensure a flawless customer experience in agile environments.

In an industry where consumers are very cost-conscious and increasingly fickle, customer experience is important to keeping your customers happy and also just keeping your customers at full stop. Using technology to reduce downtime within your online marketplace but also to predict trends before they happen will be key to keeping ahead of the competition. Data is needed to make the right decisions and in a complex technology environment, having a complete overview of your critical business systems and business performance could be the differentiating factor that sets your business apart.

Platform

Infrastructure

AIOps & Edwin AI

Cloud & Multi-Cloud

Digital Experience

Logs

Solutions

Business Outcome

Role

Industry

Resources

By Resources

By Topic

Learn the Platform

Company

About Us

Contents

What is monitoring?

What is observability?

How do observability and monitoring work together?

How does observability differ from monitoring?

Observability is a practice, while monitoring is driven by technology

Observability is a superset of monitoring and encompasses many other practices

Observability identifies unknowns, while monitoring aims at reporting known errors

How does monitoring fit into the larger observability ecosystem?

What’s going on in my environment?

Where is the problem happening?

Why is the problem happening?

Does observability encompass monitoring?

What makes monitoring stand on its own?

What are the benefits of monitoring and observability?

Monitoring

Save costs

Reduce risk

Increase productivity

Enhance flexibility

Observability

Eliminate debugging

Monitor health

Build better apps

Automate everything

Improve customer experience

Increase productivity

Getting started with observability and monitoring

Choose a centralized observability platform

Analyze the metrics of your application

Respond to the trends in the statistics

The bottom line

Strategically Invest in Technology

3 Things to Consider as Your Organization Looks Ahead to Its Technology and IT Needs in 2022 and Beyond

The Value of OpenTelemetry and Tracing

Installation

Prerequisites

Initialize the New Project

Run the Sinatra Server

Install the Open Telemetry Client Gems

Initialization

Putting It Together

Run the Application

Exporting Traces

Success! Viewing LogicMonitor Traces

Here is what a Constructed Trace looks like:

Next Steps

External Resources

Installation

Prerequisites:

Initialize the Project:

Next Steps

External Resources

Step 1: Find the Right Monitoring Solution

Step 2: Manage Alerts Before the User Experience Suffers

Step 3: Understand Errors Before It’s Too Late

Step 4: Follow the Dots for Faster Troubleshooting

Achieve Hybrid Observability Powered by AI with LogicMonitor

Contents

What Is Distributed Tracing?

The History of Distributed Tracing

The Relationship Between OpenTelemetry and Distributed Tracing

How Does Distributed Tracing Work?

What Does Distributed Tracing Mean Within Observability?