Table of Contents
Fetching ...

Rethinking Telemetry Design for Fine-Grained Anomaly Detection in 5G User Planes

Niloy Saha, Noura Limam, Yang Xiao, Raouf Boutaba

TL;DR

Rethinking Telemetry Design for Fine-Grained Anomaly Detection in 5G User Planes addresses the visibility gap in UPF telemetry between coarse counters and expensive per-packet postcards. To achieve fine-grained, low-overhead visibility, the authors extend Count-Min Sketch with histogram-augmented buckets and per-queue partitioning, capturing latency tails and inter-arrival distributions without maintaining per-flow state. They derive formal detectability guarantees accounting for sketch collisions and drift, and provide practical sizing rules (e.g., $w=512$, $d=3$) and binning strategies. Evaluations on a 5G UPF testbed show Kestrel delivers high detection accuracy with sub-second responsiveness and a bounded export cost, achieving roughly 10x bandwidth reduction and about 10% accuracy improvement over selective postcard schemes, demonstrating the approach's practicality for next-generation mobile networks.

Abstract

Detecting QoS anomalies in 5G user planes requires fine-grained per-flow visibility, but existing telemetry approaches face a fundamental trade-off. Coarse per-class counters are lightweight but mask transient and per-flow anomalies, while per-packet telemetry postcards provide full visibility at prohibitive cost that grows linearly with line rate. Selective postcard schemes reduce overhead but miss anomalies that fall below configured thresholds or occur during brief intervals. We present Kestrel, a sketch-based telemetry system for 5G user planes that provides fine-grained visibility into key metric distributions such as latency tails and inter-arrival times at a fraction of the cost of per-packet postcards. Kestrel extends Count-Min Sketch with histogram-augmented buckets and per-queue partitioning, which compress per-packet measurements into compact summaries while preserving anomaly-relevant signals. We develop formal detectability guarantees that account for sketch collisions, yielding principled sizing rules and binning strategies that maximize anomaly separability. Our evaluations on a 5G testbed with Intel Tofino switches show that Kestrel achieves 10% better detection accuracy than existing selective postcard schemes while reducing export bandwidth by 10x.

Rethinking Telemetry Design for Fine-Grained Anomaly Detection in 5G User Planes

TL;DR

Rethinking Telemetry Design for Fine-Grained Anomaly Detection in 5G User Planes addresses the visibility gap in UPF telemetry between coarse counters and expensive per-packet postcards. To achieve fine-grained, low-overhead visibility, the authors extend Count-Min Sketch with histogram-augmented buckets and per-queue partitioning, capturing latency tails and inter-arrival distributions without maintaining per-flow state. They derive formal detectability guarantees accounting for sketch collisions and drift, and provide practical sizing rules (e.g., , ) and binning strategies. Evaluations on a 5G UPF testbed show Kestrel delivers high detection accuracy with sub-second responsiveness and a bounded export cost, achieving roughly 10x bandwidth reduction and about 10% accuracy improvement over selective postcard schemes, demonstrating the approach's practicality for next-generation mobile networks.

Abstract

Detecting QoS anomalies in 5G user planes requires fine-grained per-flow visibility, but existing telemetry approaches face a fundamental trade-off. Coarse per-class counters are lightweight but mask transient and per-flow anomalies, while per-packet telemetry postcards provide full visibility at prohibitive cost that grows linearly with line rate. Selective postcard schemes reduce overhead but miss anomalies that fall below configured thresholds or occur during brief intervals. We present Kestrel, a sketch-based telemetry system for 5G user planes that provides fine-grained visibility into key metric distributions such as latency tails and inter-arrival times at a fraction of the cost of per-packet postcards. Kestrel extends Count-Min Sketch with histogram-augmented buckets and per-queue partitioning, which compress per-packet measurements into compact summaries while preserving anomaly-relevant signals. We develop formal detectability guarantees that account for sketch collisions, yielding principled sizing rules and binning strategies that maximize anomaly separability. Our evaluations on a 5G testbed with Intel Tofino switches show that Kestrel achieves 10% better detection accuracy than existing selective postcard schemes while reducing export bandwidth by 10x.

Paper Structure

This paper contains 20 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Anomaly signatures require appropriate monitoring granularity. (a) Congestion produces sustained tail latency on specific QFIs. (b) Policy abuse appears benign in QFI aggregates but is visible at the culprit TEID. (c) Microbursts vanish in second-level averages yet emerge with sub-second resolution. (d) Contention induces distributed oscillations in inter-arrival times across flows.
  • Figure 2: Data structure and operations of Kestrel. Buckets extend CMS to record packet/byte totals, latency histograms, IAT histograms, and policer colors. Example shows an arriving packet (TEID=87, QFI=3, QID=2, latency=25 µs, IAT=18 µs, color=yellow) being hashed to a bucket: totals and color counters are incremented, latency/IAT bins are updated using QID-specific edges, and the last is timestamp refreshed (updates in red).
  • Figure 3: Detection accuracy across telemetry approaches. (a) Single-anomaly scenarios (Scenarios 1–4): Continuous, distributional visibility (Kestrel) outperforms coarse counters and threshold-triggered sampling. (b) Mixed anomalies (Scenario 5): Kestrel sustains high AUPRC and F1, demonstrating robustness to concurrent anomaly types.
  • Figure 4: Pareto of AUPRC vs telemetry cost. Kestrel's best ($w{=}512,d{=}3$) reaches AUPRC 0.81 at $\sim$6 Mbps, 2$\times$ higher than 3GPP-PM and 10$\times$ cheaper than $\Delta$-SMP.
  • Figure 5: Median time to first detection (left: by scheme with F1 scores; right: by anomaly type).
  • ...and 1 more figures