OXN -- Automated Observability Assessments for Cloud-Native Applications
Maria C. Borges, Joshua Bauer, Sebastian Werner
TL;DR
The paper tackles the lack of systematic observability decision-making in cloud-native microservices by introducing the Observability Experiment Engine (OXN). It presents an automated, reproducible framework that injects faults and can reconfigure observability instrumentation to evaluate design trade-offs, with experiments defined in YAML and executed via infrastructure-as-code, storing data for analysis in a Jupyter notebook. The key contributions are the OXN architecture, a fault-observability measurement approach, and an extensible treatment library, enabling practitioners to compare visibility versus cost and integrate into deployment pipelines. This work enables evidence-based observability design and can support training data generation for anomaly detection.
Abstract
Observability is important to ensure the reliability of microservice applications. These applications are often prone to failures, since they have many independent services deployed on heterogeneous environments. When employed "correctly", observability can help developers identify and troubleshoot faults quickly. However, instrumenting and configuring the observability of a microservice application is not trivial but tool-dependent and tied to costs. Practitioners need to understand observability-related trade-offs in order to weigh between different observability design alternatives. Still, these architectural design decisions are not supported by systematic methods and typically just rely on "professional intuition". To assess observability design trade-offs with concrete evidence, we advocate for conducting experiments that compare various design alternatives. Achieving a systematic and repeatable experiment process necessitates automation. We present a proof-of-concept implementation of an experiment tool - Observability eXperiment eNgine (OXN). OXN is able to inject arbitrary faults into an application, similar to Chaos Engineering, but also possesses the unique capability to modify the observability configuration, allowing for the straightforward assessment of design decisions that were previously left unexplored.
