Visualizing Distributed Traces in Aggregate
Adrita Samanta, Henry Han, Darby Huye, Lan Liu, Zhaoqi Zhang, Raja R. Sambasivan
TL;DR
This paper tackles the challenge of analyzing large-scale distributed traces by proposing an aggregation approach that clusters similar traces and visualizes group-level variations. It defines trace similarity using Jaccard-like encodings and employs Disjoint Set Union to form trace groups, from which a representative trace is chosen and an aggregate trace graph is constructed to capture group-wide relationships and deviations. The method includes preprocessing to remove incomplete traces, multiple encoding definitions (notably service-set and exact-structure), and a threshold-based graph construction with an optimization procedure to approach a target number of groups. Visualizations are developed to depict groups and selected services, and the approach is evaluated on synthetic datasets, reporting both effectiveness and performance metrics along with threshold-finding results. Overall, the work advances scalable trace analysis by providing a practical pipeline for grouping, representing, and visualizing trace datasets to aid debugging and system optimization.
Abstract
Distributed systems are comprised of many components that communicate together to form an application. Distributed tracing gives us visibility into these complex interactions, but it can be difficult to reason about the system's behavior, even with traces. Systems collect large amounts of tracing data even with low sampling rates. Even when there are patterns in the system, it is often difficult to detect similarities in traces since current tools mainly allow developers to visualize individual traces. Debugging and system optimization is difficult for developers without an understanding of the whole trace dataset. In order to help present these similarities, this paper proposes a method to aggregate traces in a way that groups together and visualizes similar traces. We do so by assigning a few traces that are representative of each set. We suggest that traces can be grouped based on how many services they share, how many levels the graph has, how structurally similar they are, or how close their latencies are. We also develop an aggregate trace data structure as a way to comprehensively visualize these groups and a method for filtering out incomplete traces if a more complete version of the trace exists. The unique traces of each group are especially useful to developers for troubleshooting. Overall, our approach allows for a more efficient method of analyzing system behavior.
