Table of Contents
Fetching ...

UniSage: A Unified and Post-Analysis-Aware Sampling for Microservices

Zhouruixing Zhu, Zhihan Jiang, Tianyi Yang, Pinjia He

TL;DR

UniSage tackles the challenge of observability data deluge in microservices by reversing the traditional sampling order: it first runs lightweight anomaly detection and RCA on the full multi-modal data, then uses the results to guide a dual-pillar sampling that preserves both anomaly-relevant and edge-case signals. By fusing traces, logs, and metrics and propagating cross-modal evidence, UniSage achieves higher coverage of critical data at low budgets while maintaining efficient production-ready performance. Experiments on TrainTicket and OnlineBoutique demonstrate substantial gains in sampling quality and RCA accuracy (AC@1 improvements and up to 42.45% gains in RCA), with end-to-end processing under seconds for minutes of telemetry. The framework's interpretable sampling decisions and unified handling of multiple observability signals make it practically impactful for production diagnostics and fault isolation in large-scale microservice deployments.

Abstract

Traces and logs are essential for observability and fault diagnosis in modern distributed systems. However, their ever-growing volume introduces substantial storage overhead and complicates troubleshooting. Existing approaches typically adopt a sample-before-analysis paradigm: even when guided by data heuristics, they inevitably discard failure-related information and hinder transparency in diagnosing system behavior. To address this, we introduce UniSage, the first unified framework to sample both traces and logs using a post-analysis-aware paradigm. Instead of discarding data upfront, UniSagefirst performs lightweight and multi-modal anomaly detection and root cause analysis (RCA) on the complete data stream. This process yields fine-grained, service-level diagnostic insights that guide a dual-pillar sampling strategy for handling both normal and anomalous scenarios: an analysis-guided sampler prioritizes data implicated by RCA, while an edge-case-based sampler ensures rare but critical behaviors are captured. Together, these pillars ensure comprehensive coverage of critical signals without excessive redundancy. Extensive experiments demonstrate that UniSage significantly outperforms state-of-the-art baselines. At a 2.5% sampling rate, it captures 56.5% of critical traces and 96.25% of relevant logs, while improving the accuracy (AC@1) of downstream root cause analysis by 42.45%. Furthermore, its efficient pipeline processes 10 minutes of telemetry data in under 5 seconds, demonstrating its practicality for production environments.

UniSage: A Unified and Post-Analysis-Aware Sampling for Microservices

TL;DR

UniSage tackles the challenge of observability data deluge in microservices by reversing the traditional sampling order: it first runs lightweight anomaly detection and RCA on the full multi-modal data, then uses the results to guide a dual-pillar sampling that preserves both anomaly-relevant and edge-case signals. By fusing traces, logs, and metrics and propagating cross-modal evidence, UniSage achieves higher coverage of critical data at low budgets while maintaining efficient production-ready performance. Experiments on TrainTicket and OnlineBoutique demonstrate substantial gains in sampling quality and RCA accuracy (AC@1 improvements and up to 42.45% gains in RCA), with end-to-end processing under seconds for minutes of telemetry. The framework's interpretable sampling decisions and unified handling of multiple observability signals make it practically impactful for production diagnostics and fault isolation in large-scale microservice deployments.

Abstract

Traces and logs are essential for observability and fault diagnosis in modern distributed systems. However, their ever-growing volume introduces substantial storage overhead and complicates troubleshooting. Existing approaches typically adopt a sample-before-analysis paradigm: even when guided by data heuristics, they inevitably discard failure-related information and hinder transparency in diagnosing system behavior. To address this, we introduce UniSage, the first unified framework to sample both traces and logs using a post-analysis-aware paradigm. Instead of discarding data upfront, UniSagefirst performs lightweight and multi-modal anomaly detection and root cause analysis (RCA) on the complete data stream. This process yields fine-grained, service-level diagnostic insights that guide a dual-pillar sampling strategy for handling both normal and anomalous scenarios: an analysis-guided sampler prioritizes data implicated by RCA, while an edge-case-based sampler ensures rare but critical behaviors are captured. Together, these pillars ensure comprehensive coverage of critical signals without excessive redundancy. Extensive experiments demonstrate that UniSage significantly outperforms state-of-the-art baselines. At a 2.5% sampling rate, it captures 56.5% of critical traces and 96.25% of relevant logs, while improving the accuracy (AC@1) of downstream root cause analysis by 42.45%. Furthermore, its efficient pipeline processes 10 minutes of telemetry data in under 5 seconds, demonstrating its practicality for production environments.

Paper Structure

This paper contains 42 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: A real example of the end-to-end execution of a user request in a microservice system and its observability data.
  • Figure 2: Two sampling paradigms: Prior work vs. UniSage.
  • Figure 3: The comparison of siloed and unified sampling.
  • Figure 4: Overview of UniSage.
  • Figure 5: Dual-Pillar Sampling Strategy of UniSage.
  • ...and 2 more figures