TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Zhuangbin Chen; Zhihan Jiang; Yuxin Su; Michael R. Lyu; Zibin Zheng

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Zhuangbin Chen, Zhihan Jiang, Yuxin Su, Michael R. Lyu, Zibin Zheng

TL;DR

TraceMesh tackles the overhead of distributed tracing by biasing sampling toward uncommon traces in a streaming setting. It encodes traces as $\mathbb{R}^{|\mathcal{P}|}$ vectors, converts them into compact $L$-bit sketches via streaming LSH, and applies evolving clustering (DenStream) to identify and sample rare patterns under a budget $\mathcal{B}$. Key contributions include (i) trace vector encoding that captures both structural and temporal features with on-the-fly accommodation of new call paths, (ii) streaming trace sketching with StreamHash to handle unseen features without expanding input dimensionality, and (iii) a DenStream-based sampling policy with PMCs/OMCs to prevent over-sampling recurring traces. The results show TraceMesh outperforms baselines in both sampling accuracy (coverage) and efficiency on open-source benchmarks and production traces, signaling a practical impact for scalable observability in cloud-native systems.

Abstract

Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlapping and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency.

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

TL;DR

TraceMesh tackles the overhead of distributed tracing by biasing sampling toward uncommon traces in a streaming setting. It encodes traces as

vectors, converts them into compact

-bit sketches via streaming LSH, and applies evolving clustering (DenStream) to identify and sample rare patterns under a budget

. Key contributions include (i) trace vector encoding that captures both structural and temporal features with on-the-fly accommodation of new call paths, (ii) streaming trace sketching with StreamHash to handle unseen features without expanding input dimensionality, and (iii) a DenStream-based sampling policy with PMCs/OMCs to prevent over-sampling recurring traces. The results show TraceMesh outperforms baselines in both sampling accuracy (coverage) and efficiency on open-source benchmarks and production traces, signaling a practical impact for scalable observability in cloud-native systems.

Abstract

Paper Structure (20 sections, 6 equations, 6 figures, 2 tables)

This paper contains 20 sections, 6 equations, 6 figures, 2 tables.

Introduction
Background
Distributed Traces and Their Sampling
Problem Statement
Methodology
Trace Vector Encoding
Trace Sketching by Hashing
Trace Sampling by Evolving Clustering
Complexity Analysis
Evaluation
Experimental Settings
Datasets
Evaluation Metrics
Baseline Methods
Experimental Results
...and 5 more sections

Figures (6)

Figure 1: The overall framework of TraceMesh
Figure 2: Trace vector encoding
Figure 3: Streaming trace vector encoding
Figure 4: Coverage with different sampling budgets
Figure 5: Efficiency on the Industry dataset
...and 1 more figures

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

TL;DR

Abstract

TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Authors

TL;DR

Abstract

Table of Contents

Figures (6)