How to Use Quarkus With Micrometer Metrics to Monitor Microservice Pipeline

How to Use Quarkus With Micrometer Metrics to Monitor Microservice Pipeline

At LogicMonitor, we deal primarily with large quantities of time series data. Our backend infrastructure processes billions of metrics, events, and configurations daily.

In previous blogs, we discussed our transition from monolith to microservice. We also explained why we chose Quarkus as our microservices framework for our Java-based microservices.

In this blog we will cover:

What Are Quarkus and Micrometer Metrics?

Quarkus is a Kubernetes Native Java stack tailored for OpenJDK HotSpot and GraalVM. It is crafted from the best of breed Java libraries and standards.

Micrometer is a metrics instrumentation library for JVM-based applications. It is a code facing Java API that is agnostic from vendors and their specific collection mechanisms. It is designed to simplify metrics collection from an application and help maximize portability.

Quarkus provides an extension to streamline Micrometer integration for JVM and custom metrics collection.

// gradle dependency for the Quarkus Micrometer extension
implementation 'io.quarkus:quarkus-micrometer:1.11.0.Final'
// gradle dependency for an in-memory registry designed to operate on a pull model
implementation 'io.micrometer:micrometer-registry-prometheus:1.6.3'

What Are the Two Major KPIs of Metrics Processing Pipeline?

For our metrics processing pipeline, our two major KPIs (Key Performance Indicators) are the number of processed messages and the latency of the whole pipeline across multiple microservices.

We are interested in the number of processed messages over time in order to detect anomalies in the expected workload of the application. Our workload is variable across time but normally follows predictable patterns. This allows us not only to detect greater than expected load and react accordingly, but also to proactively detect potential data collection issues.

In addition to the data volume, we are interested in the pipeline latency. This metric is measured for all messages from the first ingestion time to being fully processed. This metric allows us to monitor the health of the pipeline as a whole in conjunction with microservice specific metrics and includes the time spent in transit in Kafka clusters between our different microservices. Because we monitor the total processing duration for each message, we can report and alert on average processing time, but also different percentile values like p50, p95, and p999. This can help detect when one or multiple nodes in a microservice along the pipeline are unhealthy. The average processing duration across all messages might not change much but the high percentile (p99, p999) will increase indicating a localized issue.

Additionally to our KPIs, Micrometer exposes JVM metrics that can be used for normal application monitoring, like memory usage, CPU usage, garbage collection, and much more.

How to Use Micrometer Metrics

In order to use Micrometer within the context of Quarkus, two dependencies are required. The quarkus-micrometer dependency provides the interfaces and other classes needed to instrument the code. In addition, a dependency to the desired registry is also necessary. Multiple registries are available to choose from. Micrometer-registry-prometheus is an in-memory registry that with Quarkus allows us to expose metrics easily with a rest endpoint. Those two dependencies are combined into one extension starting with Quarkus 1.11.0.Final.

The Quarkus Micrometer extension supports very convenient annotations to count and time methods invocation (@Counted and @Timed). This, however, is limited to methods in a single microservice.

@Timed(
   value = "processMessage",
   description = "How long it takes to process a message"
)
public void processMessage(String message) {
   // Process the message
}

It is also possible to programmatically create and provide values for Timer metrics. This is helpful when you want to instrument a duration, but want to provide individual measurements. We are using this method to track the KPIs for our microservice pipeline. We attach the ingestion timestamp as a Kafka header to each message and can track the time spent throughout the pipeline.

@ApplicationScoped
public class Processor {

   private MeterRegistry registry;
   private Timer timer;

   // Quarkus injects the MeterRegistry
   public Processor(MeterRegistry registry) {
       this.registry = registry;
       timer = Timer.builder("pipelineLatency")
           .description("The latency of the whole pipeline.")
           .publishPercentiles(0.5, 0.75, 0.95, 0.98, 0.99, 0.999)
           .percentilePrecision(3)
           .distributionStatisticExpiry(Duration.ofMinutes(5))
           .register(registry);
   }

   public void processMessage(ConsumerRecord<String, String> message) {
       /*
           Do message processing
        */
       // Retrieve the kafka header
       Optional.ofNullable(message.headers().lastHeader("pipelineIngestionTimestamp"))
           // Get the value of the header
           .map(Header::value)
           // Read the bytes as String
           .map(v -> new String(v, StandardCharsets.UTF_8))
           // Parse as long epoch in millisecond
           .map(v -> {
               try {
                   return Long.parseLong(v);
               } catch (NumberFormatException e) {
                   // The header can't be parsed as a Long
                   return null;
               }
           })
           // Calculate the duration between the start and now
           // If there is a discrepancy in the clocks the calculated
           // duration might be less than 0. Those will be dropped by MicroMeter
           .map(t -> System.currentTimeMillis() - t)
           .ifPresent(d -> timer.record(d, TimeUnit.MILLISECONDS));
   }
}

The timer metric with aggregation can then be retrieved via the REST endpoint at https://quarkusHostname/metrics.

# HELP pipelineLatency_seconds The latency of the whole pipeline.
# TYPE pipelineLatency_seconds summary
pipelineLatency_seconds{quantile="0.5",} 0.271055872
pipelineLatency_seconds{quantile="0.75",} 0.386137088
pipelineLatency_seconds{quantile="0.95",} 0.483130368
pipelineLatency_seconds{quantile="0.98",} 0.48915968
pipelineLatency_seconds{quantile="0.99",} 0.494140416
pipelineLatency_seconds{quantile="0.999",} 0.498072576
pipelineLatency_seconds_count 168.0
pipelineLatency_seconds_sum 42.581
# HELP pipelineLatency_seconds_max The latency of the whole pipeline.
# TYPE pipelineLatency_seconds_max gauge
pipelineLatency_seconds_max 0.498

We then ingest those metrics in LogicMonitor as DataPoints using collectors. 

LogicMonitor Microservice Technology Stack  

LogicMonitor’s Metric Pipeline, where we built out multiple microservices with Quarkus in our environment, is deployed on the following technology stack:

  • Java 11 (corretto, cuz licenses)
  • Kafka (managed in AWS MSK) 
  • Kubernetes 
  • Nginx (ingress controller within Kubernetes)
Kubernetes node shown in LogicMonitor.

How Do We Correlate Configuration Changes to Metrics?

Once those metrics are ingested in LogicMonitor, they can be displayed as graphs or integrated into dashboards. They can also be used for alerting, anomaly detections, and in conjunction with ops-notes can be visualized in relation to infrastructure or configuration changes, as well as other significant events.

Below is an example of an increase in processing duration correlated to the deployment of a new version. The deployment of a new version automatically triggers an ops-note that can then be displayed on graphs and dashboards. In this example, this functionality facilitates the correlation between latency increase and service deployment.

An increase in processing duration correlated to the deployment of a new version. The deployment of a new version automatically triggers an ops-note that can then be displayed on graphs and dashboards. In this example, this functionality facilitates the correlation between latency increase and service deployment.

How to Track Anomalies

All of our microservices are monitored with LogicMonitor. Here is an example of Anomaly Detection for the pipeline latencies 95 percentile. LogicMonitor dynamically figures out the normal operating values and creates a band of expected values. It is then possible to define alerts when values fall outside the generated band.

An example of Anomaly Detection for the pipeline latencies 95 percentile in LogicMonitor.

As seen above, the integration of MicroMeter with Quarkus allows in conjunction with LogicMonitor a straightforward, easy, and quick way to add visibility into our microservices. This ensures that our processing pipeline provides the most value to our clients while minimizing the monitoring effort for our engineers, reducing cost, and increasing productivity.