Table of Contents
Fetching ...

Visualizing Distributed Traces in Aggregate

Adrita Samanta, Henry Han, Darby Huye, Lan Liu, Zhaoqi Zhang, Raja R. Sambasivan

TL;DR

This paper tackles the challenge of analyzing large-scale distributed traces by proposing an aggregation approach that clusters similar traces and visualizes group-level variations. It defines trace similarity using Jaccard-like encodings and employs Disjoint Set Union to form trace groups, from which a representative trace is chosen and an aggregate trace graph is constructed to capture group-wide relationships and deviations. The method includes preprocessing to remove incomplete traces, multiple encoding definitions (notably service-set and exact-structure), and a threshold-based graph construction with an optimization procedure to approach a target number of groups. Visualizations are developed to depict groups and selected services, and the approach is evaluated on synthetic datasets, reporting both effectiveness and performance metrics along with threshold-finding results. Overall, the work advances scalable trace analysis by providing a practical pipeline for grouping, representing, and visualizing trace datasets to aid debugging and system optimization.

Abstract

Distributed systems are comprised of many components that communicate together to form an application. Distributed tracing gives us visibility into these complex interactions, but it can be difficult to reason about the system's behavior, even with traces. Systems collect large amounts of tracing data even with low sampling rates. Even when there are patterns in the system, it is often difficult to detect similarities in traces since current tools mainly allow developers to visualize individual traces. Debugging and system optimization is difficult for developers without an understanding of the whole trace dataset. In order to help present these similarities, this paper proposes a method to aggregate traces in a way that groups together and visualizes similar traces. We do so by assigning a few traces that are representative of each set. We suggest that traces can be grouped based on how many services they share, how many levels the graph has, how structurally similar they are, or how close their latencies are. We also develop an aggregate trace data structure as a way to comprehensively visualize these groups and a method for filtering out incomplete traces if a more complete version of the trace exists. The unique traces of each group are especially useful to developers for troubleshooting. Overall, our approach allows for a more efficient method of analyzing system behavior.

Visualizing Distributed Traces in Aggregate

TL;DR

This paper tackles the challenge of analyzing large-scale distributed traces by proposing an aggregation approach that clusters similar traces and visualizes group-level variations. It defines trace similarity using Jaccard-like encodings and employs Disjoint Set Union to form trace groups, from which a representative trace is chosen and an aggregate trace graph is constructed to capture group-wide relationships and deviations. The method includes preprocessing to remove incomplete traces, multiple encoding definitions (notably service-set and exact-structure), and a threshold-based graph construction with an optimization procedure to approach a target number of groups. Visualizations are developed to depict groups and selected services, and the approach is evaluated on synthetic datasets, reporting both effectiveness and performance metrics along with threshold-finding results. Overall, the work advances scalable trace analysis by providing a practical pipeline for grouping, representing, and visualizing trace datasets to aid debugging and system optimization.

Abstract

Distributed systems are comprised of many components that communicate together to form an application. Distributed tracing gives us visibility into these complex interactions, but it can be difficult to reason about the system's behavior, even with traces. Systems collect large amounts of tracing data even with low sampling rates. Even when there are patterns in the system, it is often difficult to detect similarities in traces since current tools mainly allow developers to visualize individual traces. Debugging and system optimization is difficult for developers without an understanding of the whole trace dataset. In order to help present these similarities, this paper proposes a method to aggregate traces in a way that groups together and visualizes similar traces. We do so by assigning a few traces that are representative of each set. We suggest that traces can be grouped based on how many services they share, how many levels the graph has, how structurally similar they are, or how close their latencies are. We also develop an aggregate trace data structure as a way to comprehensively visualize these groups and a method for filtering out incomplete traces if a more complete version of the trace exists. The unique traces of each group are especially useful to developers for troubleshooting. Overall, our approach allows for a more efficient method of analyzing system behavior.

Paper Structure

This paper contains 35 sections, 14 figures.

Figures (14)

  • Figure 1: Tprof Aggregate trace of 3 traces.
  • Figure 2: Design Diagram. Flowchart displaying the major steps of our method.
  • Figure 3: Preprocessing traces. Example of two traces, trace 1 is the incomplete version of trace 2. The edges marked in red are the edges that are in both traces 1 and 2. Note: all edges in trace 1 are marked in red.
  • Figure 4: Preprocessing design diagram. Flowchart displaying major steps of our preprocessing method which is discussed in § \ref{['sec:similar:preprocessing']}.
  • Figure 5: Example traces 1 and 2 with service names.
  • ...and 9 more figures