Informed and Assessable Observability Design Decisions in Cloud-native Microservice Applications

Maria C. Borges; Joshua Bauer; Sebastian Werner; Michael Gebauer; Stefan Tai

Informed and Assessable Observability Design Decisions in Cloud-native Microservice Applications

Maria C. Borges, Joshua Bauer, Sebastian Werner, Michael Gebauer, Stefan Tai

TL;DR

The paper tackles the challenge of making cloud-native observability for microservice architectures measurable and comparable. It advances a formal model of the observability design space, introduces fault visibility metrics and composite scores (e.g., $FC_{f,d}$ and $OFO_{d}$), and couples them with a cost metric to weigh observability against overhead. It then presents Oxn, a Chaos-Engineering-inspired, YAML-driven experiment engine that can inject faults and modify observability configurations to evaluate design alternatives on a real-world OpenTelemetry-based demo application. Through a proof-of-concept evaluation, it demonstrates how different instrumentation and sampling configurations affect fault observability and related costs, enabling data-driven trade-offs. The work lays groundwork for systematic, reproducible observability design decisions and potential CI/CD integration for continuous improvement, while outlining avenues for broader platform support and richer fault scenarios.

Abstract

Observability is important to ensure the reliability of microservice applications. These applications are often prone to failures, since they have many independent services deployed on heterogeneous environments. When employed "correctly", observability can help developers identify and troubleshoot faults quickly. However, instrumenting and configuring the observability of a microservice application is not trivial but tool-dependent and tied to costs. Architects need to understand observability-related trade-offs in order to weigh between different observability design alternatives. Still, these architectural design decisions are not supported by systematic methods and typically just rely on "professional intuition". In this paper, we argue for a systematic method to arrive at informed and continuously assessable observability design decisions. Specifically, we focus on fault observability of cloud-native microservice applications, and turn this into a testable and quantifiable property. Towards our goal, we first model the scale and scope of observability design decisions across the cloud-native stack. Then, we propose observability metrics which can be determined for any microservice application through so-called observability experiments. We present a proof-of-concept implementation of our experiment tool OXN. OXN is able to inject arbitrary faults into an application, similar to Chaos Engineering, but also possesses the unique capability to modify the observability configuration, allowing for the assessment of design decisions that were previously left unexplored. We demonstrate our approach using a popular open source microservice application and show the trade-offs involved in different observability design decisions.

Informed and Assessable Observability Design Decisions in Cloud-native Microservice Applications

TL;DR

and

), and couples them with a cost metric to weigh observability against overhead. It then presents Oxn, a Chaos-Engineering-inspired, YAML-driven experiment engine that can inject faults and modify observability configurations to evaluate design alternatives on a real-world OpenTelemetry-based demo application. Through a proof-of-concept evaluation, it demonstrates how different instrumentation and sampling configurations affect fault observability and related costs, enabling data-driven trade-offs. The work lays groundwork for systematic, reproducible observability design decisions and potential CI/CD integration for continuous improvement, while outlining avenues for broader platform support and richer fault scenarios.

Abstract

Paper Structure (14 sections, 3 equations, 7 figures, 3 tables)

This paper contains 14 sections, 3 equations, 7 figures, 3 tables.

Introduction
Related Work
Modeling Observability Design Decisions
Approach to quantify observability effectiveness
Concept: Fault Observability Metrics
Approach: Observability Experiments
OXN: Observability Experiment Engine
Architecture
Applicability and Exemplary Observability Design Assessment
SUE Setup
Evaluating the Baseline - Results
Evaluating Design Alternatives - Results
Limitations and Future Work
Conclusion

Figures (7)

Figure 1: Model of Observability Design Decisions in Cloud-Native Applications
Figure 2: (A) Visualization of the fault model, used to define the metrics in (B)
Figure 3: System architecture of Oxn
Figure 4: SUE with baseline observability configuration
Figure 5: Experiments run against the SUE using Oxn, showing how different faults appear visually, similar to how a developer would see them in a dashboard. Note how the Pause fault is visible in all metrics. PacketLoss is noticeable in systemCPU but less pronounced in other metrics, with NetworkDelay not being visible at all. In \ref{['fig:plots2']}, we see how the changes to the observability configuration proposed in \ref{['fig:alternatives']} affect these metrics.
...and 2 more figures

Informed and Assessable Observability Design Decisions in Cloud-native Microservice Applications

TL;DR

Abstract

Informed and Assessable Observability Design Decisions in Cloud-native Microservice Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (7)