Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo
Authors
Anatoly A. Krasnovsky
Abstract
While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo ("Astronomy Shop") using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the overall availability degradation curve, while asynchronous semantics for Kafka edges change predicted availabilities by at most about 10^(-5) (0.001 percentage points). This null result suggests that for immediate HTTP availability in this case study, explicitly modeling asynchronous dependencies is not warranted, and a simpler connectivity-only model is sufficient.