Table of Contents
Fetching ...

Causal Intervention Sequence Analysis for Fault Tracking in Radio Access Networks

Chenhua Shi, Joji Philip, Subhadip Bandyopadhyay, Jayanta Choudhury

TL;DR

The paper addresses SLA breach fault tracking in RAN using millisecond-scale telemetry, where traditional coarse-grained approaches fail to reveal both root-cause indicators and their causal order. It introduces a three-component pipeline—Root-Cause Discovery (RCD), causal subgraph analysis, and deviation detection—to identify intervention sequences leading to SLA violations, leveraging KS tests and Z-scores for temporal ordering. Monte Carlo simulations show that the approach yields convergent estimates of causal-source probabilities and identifies reliable KPIs, while the method remains CPU-friendly and scalable for edge deployments. Overall, the framework enables proactive fault prevention by delivering high-resolution, causally ordered insights that are directly actionable for network operators.

Abstract

To keep modern Radio Access Networks (RAN) running smoothly, operators need to spot the real-world triggers behind Service-Level Agreement (SLA) breaches well before customers feel them. We introduce an AI/ML pipeline that does two things most tools miss: (1) finds the likely root-cause indicators and (2) reveals the exact order in which those events unfold. We start by labeling network data: records linked to past SLA breaches are marked `abnormal', and everything else `normal'. Our model then learns the causal chain that turns normal behavior into a fault. In Monte Carlo tests the approach pinpoints the correct trigger sequence with high precision and scales to millions of data points without loss of speed. These results show that high-resolution, causally ordered insights can move fault management from reactive troubleshooting to proactive prevention.

Causal Intervention Sequence Analysis for Fault Tracking in Radio Access Networks

TL;DR

The paper addresses SLA breach fault tracking in RAN using millisecond-scale telemetry, where traditional coarse-grained approaches fail to reveal both root-cause indicators and their causal order. It introduces a three-component pipeline—Root-Cause Discovery (RCD), causal subgraph analysis, and deviation detection—to identify intervention sequences leading to SLA violations, leveraging KS tests and Z-scores for temporal ordering. Monte Carlo simulations show that the approach yields convergent estimates of causal-source probabilities and identifies reliable KPIs, while the method remains CPU-friendly and scalable for edge deployments. Overall, the framework enables proactive fault prevention by delivering high-resolution, causally ordered insights that are directly actionable for network operators.

Abstract

To keep modern Radio Access Networks (RAN) running smoothly, operators need to spot the real-world triggers behind Service-Level Agreement (SLA) breaches well before customers feel them. We introduce an AI/ML pipeline that does two things most tools miss: (1) finds the likely root-cause indicators and (2) reveals the exact order in which those events unfold. We start by labeling network data: records linked to past SLA breaches are marked `abnormal', and everything else `normal'. Our model then learns the causal chain that turns normal behavior into a fault. In Monte Carlo tests the approach pinpoints the correct trigger sequence with high precision and scales to millions of data points without loss of speed. These results show that high-resolution, causally ordered insights can move fault management from reactive troubleshooting to proactive prevention.

Paper Structure

This paper contains 15 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: PCMCI on Cell Load Issue with Normal State
  • Figure 2: PCMCI on Cell Load Issue with Abnormal State
  • Figure 3: Causally relevant indicators in yellow with SLA breach in red and the preceding intervention events by dashed arrows in red with STEP number in ascending order corresponding to the display sequence of anomaly events.
  • Figure 4: The blue color histogram distribution in normal vs the orange color corresponding to abnormal confirms the results of the method.
  • Figure 5: Left panel shows the raw data behavior change from normal in light green to abnormal in purple and the right panel shows the time delay shift of the starting of anomaly events from the output of the univariate deviation detection, where 0 on the right panel corresponds to no anomaly, -1 indicates to decreasing below threshold and 1 indicates increasing above the threshold.
  • ...and 2 more figures