TraceMesh: Scalable and Streaming Sampling for Distributed Traces
Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng
TL;DR
TraceMesh tackles the overhead of distributed tracing by biasing sampling toward uncommon traces in a streaming setting. It encodes traces as $\mathbb{R}^{|\mathcal{P}|}$ vectors, converts them into compact $L$-bit sketches via streaming LSH, and applies evolving clustering (DenStream) to identify and sample rare patterns under a budget $\mathcal{B}$. Key contributions include (i) trace vector encoding that captures both structural and temporal features with on-the-fly accommodation of new call paths, (ii) streaming trace sketching with StreamHash to handle unseen features without expanding input dimensionality, and (iii) a DenStream-based sampling policy with PMCs/OMCs to prevent over-sampling recurring traces. The results show TraceMesh outperforms baselines in both sampling accuracy (coverage) and efficiency on open-source benchmarks and production traces, signaling a practical impact for scalable observability in cloud-native systems.
Abstract
Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlapping and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency.
