Table of Contents
Fetching ...

An Online Probabilistic Distributed Tracing System

M. Toslali, S. Qasim, S. Parthasarathy, F. A. Oliveira, H. Huang, G. Stringhini, Z. Liu, A. K. Coskun

TL;DR

This work tackles the cost-utility tension in distributed tracing by introducing Astraea, an online probabilistic tracing system that uses Bayesian online learning and bandit-based sampling to identify and monitor only the spans most informative for diagnosing performance variations. By maintaining low-dimensional Beta beliefs per span and applying a percentile-based threshold with approximate Bayesian sampling, Astraea adaptively shifts instrumentation toward vital spans, achieving high diagnostic accuracy while dramatically reducing trace overhead. Empirical evaluation across three cloud applications and production traces shows Astraea localizes faults with about 92% top-5 accuracy using only ~25% of spans, and operates with sub-100 ms inference, indicating strong scalability and practicality for large production systems. The work demonstrates a concrete, online approach to automate instrumentation control, reducing overhead without sacrificing diagnostic power and offering actionable outputs like span utilities, rankings, and correlation tags to aid developers.

Abstract

Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time.

An Online Probabilistic Distributed Tracing System

TL;DR

This work tackles the cost-utility tension in distributed tracing by introducing Astraea, an online probabilistic tracing system that uses Bayesian online learning and bandit-based sampling to identify and monitor only the spans most informative for diagnosing performance variations. By maintaining low-dimensional Beta beliefs per span and applying a percentile-based threshold with approximate Bayesian sampling, Astraea adaptively shifts instrumentation toward vital spans, achieving high diagnostic accuracy while dramatically reducing trace overhead. Empirical evaluation across three cloud applications and production traces shows Astraea localizes faults with about 92% top-5 accuracy using only ~25% of spans, and operates with sub-100 ms inference, indicating strong scalability and practicality for large production systems. The work demonstrates a concrete, online approach to automate instrumentation control, reducing overhead without sacrificing diagnostic power and offering actionable outputs like span utilities, rankings, and correlation tags to aid developers.

Abstract

Distributed tracing has become a fundamental tool for diagnosing performance issues in the cloud by recording causally ordered, end-to-end workflows of request executions. However, tracing in production workloads can introduce significant overheads due to the extensive instrumentation needed for identifying performance variations. This paper addresses the trade-off between the cost of tracing and the utility of the "spans" within that trace through Astraea, an online probabilistic distributed tracing system. Astraea is based on our technique that combines online Bayesian learning and multi-armed bandit frameworks. This formulation enables Astraea to effectively steer tracing towards the useful instrumentation needed for accurate performance diagnosis. Astraea localizes performance variations using only 10-28% of available instrumentation, markedly reducing tracing overhead, storage, compute costs, and trace analysis time.
Paper Structure (16 sections, 1 equation, 11 figures, 1 table)

This paper contains 16 sections, 1 equation, 11 figures, 1 table.

Figures (11)

  • Figure 1: The practicality of tracing is limited by (a) overhead and (b) large portion of extraneous instrumentation. (a) Tracing overheads can increase end-to-end request latency. Tracing in this experiment is conducted with a 100% sampling rate to emphasize the overhead. (b) The majority of spans in a trace are extraneous to explain variation.
  • Figure 2: An overview of the distributed tracing architecture. The bottom shows a simplified version of the Social network application, instrumented with tracing. The top shows the tracing backend. Tracing overheads are demonstrated with circled numbers.
  • Figure 3: A simplified trace from Social network. Analysis on $\mu$ and $\sigma$ of span latency helps localize a performance issue.
  • Figure 4: Astraea design. The bottom displays an instrumented application, and the top features Astraea components guiding tracing to rewarding spans.
  • Figure 5: The latency contribution of a leaf span (B, C, and D) corresponds to its processing duration. In the case of non-leaf spans, Astraea isolates the self_segment, representing the duration when an operation is not awaiting the completion of a child operation.
  • ...and 6 more figures