Table of Contents
Fetching ...

Dynamic SLA-aware Network Slice Monitoring

Niloy Saha, Mina Tahmasbi Arashloo, Nashid Shahriar, Raouf Boutaba

TL;DR

The paper tackles real-time SLA-aware monitoring for large-scale network slicing under telemetry budget constraints by formulating monitoring as a closed-loop control problem. It introduces the Telemetry Primitive Contract (TPC) and presents SliceScope, a practical system using change-triggered In-Band Network Telemetry (INT) to dynamically allocate monitoring resources across slices and SLA metrics. Through a joint multi-slice optimization, epoch-based adaptation, and per-slice tunable thresholds, SliceScope achieves up to 4x more accurate tracking for critical slices and outperforms alternative telemetry primitives in end-to-end SLA tracking. The evaluation spans large-scale simulations and a hardware testbed on Intel Tofino, validating the approach and its deployment considerations such as bounded header size and path-aware state management. Overall, the work demonstrates a viable, adaptive framework for SLA-aware telemetry in programmable networks and outlines directions for future enhancements and broader integration.

Abstract

Next-generation networks increasingly rely on network slices - logical networks tailored to specific application requirements, each with distinct Service-Level Agreements (SLAs). Ensuring compliance with these SLAs requires continuous, real-time monitoring of end-to-end performance metrics for each slice, within a limited telemetry budget. However, we find that existing solutions face two fundamental limitations: they either lack end-to-end visibility (e.g., sketches, probabilistic sampling) or provide visibility but lack the control mechanisms to dynamically allocate monitoring resources according to slice SLAs. We address this through a formal framework that reframes slice monitoring as a closed-loop control problem, and defines the minimal data plane requirements for SLA-aware slice monitoring via a telemetry primitive contract. We then present SliceScope, a realization of this framework that combines: (1) a control strategy that dynamically allocates the monitoring resources across diverse slices according to their SLA criticality, and (2) a data-plane based on change-triggered INT that provides per-packet end-to-end visibility with tunable accuracy-overhead trade-offs, satisfying the telemetry contract. Our evaluation results on programmable switches and in large-scale simulations with a mixture of different slice types, demonstrate that SliceScope tracks critical slices up to 4x more accurately compared to static baselines, while showing that change-triggered INT outperforms alternative primitives for realizing the telemetry primitive contract.

Dynamic SLA-aware Network Slice Monitoring

TL;DR

The paper tackles real-time SLA-aware monitoring for large-scale network slicing under telemetry budget constraints by formulating monitoring as a closed-loop control problem. It introduces the Telemetry Primitive Contract (TPC) and presents SliceScope, a practical system using change-triggered In-Band Network Telemetry (INT) to dynamically allocate monitoring resources across slices and SLA metrics. Through a joint multi-slice optimization, epoch-based adaptation, and per-slice tunable thresholds, SliceScope achieves up to 4x more accurate tracking for critical slices and outperforms alternative telemetry primitives in end-to-end SLA tracking. The evaluation spans large-scale simulations and a hardware testbed on Intel Tofino, validating the approach and its deployment considerations such as bounded header size and path-aware state management. Overall, the work demonstrates a viable, adaptive framework for SLA-aware telemetry in programmable networks and outlines directions for future enhancements and broader integration.

Abstract

Next-generation networks increasingly rely on network slices - logical networks tailored to specific application requirements, each with distinct Service-Level Agreements (SLAs). Ensuring compliance with these SLAs requires continuous, real-time monitoring of end-to-end performance metrics for each slice, within a limited telemetry budget. However, we find that existing solutions face two fundamental limitations: they either lack end-to-end visibility (e.g., sketches, probabilistic sampling) or provide visibility but lack the control mechanisms to dynamically allocate monitoring resources according to slice SLAs. We address this through a formal framework that reframes slice monitoring as a closed-loop control problem, and defines the minimal data plane requirements for SLA-aware slice monitoring via a telemetry primitive contract. We then present SliceScope, a realization of this framework that combines: (1) a control strategy that dynamically allocates the monitoring resources across diverse slices according to their SLA criticality, and (2) a data-plane based on change-triggered INT that provides per-packet end-to-end visibility with tunable accuracy-overhead trade-offs, satisfying the telemetry contract. Our evaluation results on programmable switches and in large-scale simulations with a mixture of different slice types, demonstrate that SliceScope tracks critical slices up to 4x more accurately compared to static baselines, while showing that change-triggered INT outperforms alternative primitives for realizing the telemetry primitive contract.

Paper Structure

This paper contains 18 sections, 3 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Closed-loop formulation of SLA-aware slice monitoring. The control plane continuously learns from telemetry observations that reflect slice-level performance, estimates monitoring error $E(k_{s,m})$ and overhead $\Gamma(k_{s,m})$, and tunes per-slice knobs $k_{s,m}$ of the data plane telemetry primitive to satisfy SLA objectives under resource constraints.
  • Figure 2: Analytical framework for change–driven telemetry insertion. (a) Accuracy–overhead trade-off as a function of the telemetry threshold $\Delta_{s,m}$: increasing $\Delta_{s,m}$ reduces telemetry overhead (blue) but increases monitoring error (red). (b) Example trace of the simulated cumulative metric difference $D(t,\Delta_{s,m})$, which mimics how $\delta_{s,m}$ accumulates in the data plane, and triggers telemetry insertion whenever it exceeds $\Delta_{s,m}$. (c) Probability density of $D(t,\Delta_{s,m})$, where the shaded region corresponds to the insertion probability $\beta(\Delta_{s,m})$.
  • Figure 3: Per-slice telemetry state stored in each bucket $A_{ij}$ of the bucket arrays. Each bucket maintains the slice key $x$, e2e estimates ($E_{prev},E_{rep},E_{last}$), auxiliary state $V_{aux}$ (e.g., counters for packet loss), and a miss flag for recovery.
  • Figure 4: Telemetry header format. The fixed part (always present) includes a shim and per-hop metadata. The conditional part (inserted only when $\delta_{s,m} >\Delta_{s,m}$) carries updated e2e metric values $E_{curr}$ and auxiliary state $V_{aux}$ for computing metrics such as jitter or packet loss.
  • Figure 5: Pareto frontiers comparing per-packet overhead and tolerance violations across different workload mixes (SP, BAL, and LP). SliceScope "moves" monitoring resources from less critical slices to more critical ones as needed. For Static Slice-Agnostic, points represent varying $\Delta$ values (1-20), for Static Slice-Aware, different combinations of $\Delta$ values for (URLLC, eMBB, mMTC) slices, and for SliceScope, different values of tuning parameter $\lambda$.
  • ...and 5 more figures