How to Keep Observability Alive in Microservice Landscapes through OpenTelemetry

Christian Fröhlingsdorf
March 27, 2024
Table of Contents:

The concept of observability has become a cornerstone for ensuring system reliability and efficiency in modern software engineering and operations. Observability, beyond its traditional scope of logging, monitoring, and tracing, can be intricately defined through the lens of incident response efficiency—specifically by examining the time it takes for teams to grasp the full context and background of a technical incident.

Optimizing Time to Understanding

This nuanced view introduces the critical metric of Time to Understanding (TTU), a dimension that serves as a pivotal link in the chain of incident management metrics, including Time to Acknowledge (TTA) and Time to Resolve (TTR). TTU emerges as a metric by quantifying the duration from when an alert is received to when a team fully comprehends the incident's scope, impact, and underlying causes. In the complex alerting context, TTU not only bridges the interval between initial alert acknowledgment (TTA) and the commencement of resolution efforts (TTR) but also plays a transformative role in refining alert management strategies. By optimizing TTU, organizations can significantly enhance their operational resilience, minimize downtime, and streamline their path to incident resolution. It is important to understand, however, that optimizing for TTU completely differs based on the underlying infrastructure and architecture of the maintained software product.

The Impact of Microservice Architectures on Time to Understanding

Software deployments often favor monolithic architectures due to their simplicity, where the application is built as a single, unified unit. This approach made understanding the system's functionality and debugging issues more straightforward, as all components operated within a single codebase and runtime environment. However, when development teams and application complexity grow, the limitations of monolithic architectures, such as scalability and deployment speed, push organizations towards microservices architectures. Microservices, which break down the application into smaller, independently deployable services, offer greater flexibility and scalability. Yet, this fragmentation introduces a chaotic nature to the system's understanding, as the interdependencies and interactions across numerous services can obscure the overall picture, making it challenging—almost impossible—for even large development teams to grasp the full extent of how everything works together.

Where OpenTelemetry Сomes into Play

Microservice architectures can significantly hinder observability, turning the management and troubleshooting of services into a daunting task. OpenTelemetry emerges as a design to address these challenges by providing a unified and standardized framework for collecting, processing, and shipping telemetry data (metrics, logs, and traces) from each microservice. By implementing OpenTelemetry, organizations can gain a comprehensive view of their microservice landscape, enabling them to track the flow of requests across service boundaries, understand the interactions and dependencies among disparate services, and identify performance bottlenecks or failure points with precision. This enhanced level of observability cuts through the chaotic nature of microservice architectures, facilitating a deeper understanding of system behaviors and operational dynamics.

The Three Pillars of OpenTelemetry

Metrics, the first pillar of OpenTelemetry, represent a crucial component for monitoring and understanding system performance at scale. They are designed to be lightweight and easy to store, even at high volume, making them ideal for capturing a high-level overview of system health and behavior over time. By aggregating numerical data points—such as request counts, error rates, and resource usage—metrics provide a simplified, yet comprehensive, snapshot of the operational state. However, this process of aggregation, while beneficial for scalability and manageability, can inadvertently obscure detailed information about infrequent or outlier events, concealing potential issues within the system.

Logs, while still considered "young" in the OpenTelemetry Protocol (OTLP) framework, play a critical role in diagnosing and understanding issues within microservices architectures. Logs offer a very detailed explanation of problems, capturing events in a structured or unstructured format that developers can analyze to pinpoint the root causes of issues. However, the utility of logs comes with its challenges; due to the potentially high volume of logs generated, especially in complex and distributed systems, their storage and management can become difficult. These challenges demand efficient log aggregation and management solutions to ensure that logs remain accessible and useful for troubleshooting without overwhelming the system's resources.

Tracing, the third pillar of OpenTelemetry, serves as a hybrid between metrics and logs, offering a uniquely rich and detailed view into the system's behavior by capturing very dense information, even more than traditional logs. A trace encapsulates the journey of a single request through the system, decomposed into multiple spans, where each span represents a distinct "unit of work" or operation within the service architecture. These spans collectively form a detailed timeline of the request's path, pinpointing where time is spent and where errors may occur. Despite the wealth of data traces provided, it's noteworthy that the vast majority (99.9%) of this data is never actively viewed, underscoring the selective nature of tracing data consumption.

The Heart of OpenTelemetry: The Collector

Next to the OpenTelemetry instruments, such as libraries and SDKs that help developers publish OTLP data, mentioned in the three pillars above, from within application code—the OpenTelemetry Collector marks the heart of the framework.

The Collector (OTelC) is not only a data exporter; it introduces advanced capabilities critical for managing telemetry data efficiently in distributed systems. It adeptly handles cardinality, an essential feature for maintaining data usability while preventing overwhelm in monitoring systems by reducing dimensionality where necessary. This flexibility allows OTelC instances to be chained, providing a scalable solution for preprocessing telemetry data—by filtering, sampling, and processing—before it reaches the backend. By intelligently managing what data is transmitted, including the removal of noisy, sensitive, or otherwise unnecessary information, OTelC ensures that only pertinent, high-quality data is forwarded, thereby optimizing performance and compliance.

Crucially, with OTelC positioned close to the data sources, it dramatically decreases the amount of traffic required to travel over the network, which is especially beneficial in cloud environments where data transfer costs can accumulate. This proximity allows for efficient traffic management and load reduction, ensuring high-volume telemetry data does not saturate network resources.

Moreover, OpenTelemetry instruments (SDKs, libraries) are relieved from the burden of traffic and load considerations, allowing developers to focus on instrumentation without worrying about the impact on data transfer volumes. With OTelC, managing OTLP cardinality and enhancing data efficiency becomes seamless, negating the need for invasive changes within the application code itself. Thus, code integrity is preserved while comprehensive observability is ensured.

The latter also fits well in a microservice environment where usually the dev teams themselves take care of deployment and runtime of their services and may use their own OTel collector pipelines to fine-tune their OTLP data streams without having to alter and redeploy their services.

What Standalone OpenTelemetry is Missing

While OpenTelemetry excels in collecting and exporting telemetry data in a distributed service environment, it does not include functionalities for storing this data; instead, it relies on external storage solutions to archive and manage the collected information.

Additionally, OpenTelemetry itself does not provide dashboards, which are required for visualizing data trends and insights. Instead, it requires integration with other tools to analyze and display the data.

Notifications, essential for alerting teams to system issues in real time, are also beyond the scope of OpenTelemetry's capabilities, necessitating supplementary alerting mechanisms.

Finally, OpenTelemetry does not natively support proactive testing of applications through simulated traffic or user interactions, an important aspect of understanding and ensuring system performance and reliability under various conditions.

Consequently, while OpenTelemetry is a powerful tool for observability, it must be complemented with additional systems and strategies to cover these critical areas fully.

A Top-of-the-line Observability Stack with OpenTelemetry and ilert

To address OpenTelemetry’s limitations and thereby create a top-tier observability stack, its capabilities can be significantly enhanced by integrating it with specialized tools.

For data storage and visually intuitive dashboards that aid in rapid data analysis and insights, complements OpenTelemetry by offering scalable, high-powered analytics. As well as incorporating a tool like Checkly, which specializes in proactive testing and validation of web services, closing the loop on comprehensive system monitoring. To amplify the effectiveness of alerting mechanisms, ilert can be integrated with Honeycomb and Checkly, ensuring notifications are timely, actionable, and can escalate through the correct channels such as voice calls or Microsoft Teams channel updates.

Other blog posts you might like:

ISDN Alarm System - Switch to VoIP

Read article ›

The GSM Modem and Pager Alternative

Read article ›

Building a metrics backend (time series db) with PostgreSQL and Rust

Read article ›

Get started with ilert

A better way to be on-call for your team.

Start for Free
Our Cookie Policy
We use cookies to improve your experience, analyze site traffic and for marketing. Learn more in our Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.