Anomaly Detection for Scalable Task Grouping in Reinforcement Learning-based RAN Optimization
Jimmy Li, Igor Kozlov, Di Wu, Xue Liu, Gregory Dudek
TL;DR
The paper tackles scaling reinforcement learning for cellular RAN optimization across thousands of sites by introducing a scalable policy bank that reuses policies when compatible and trains new ones only for incompatible tasks. Compatibility is determined through anomaly-detection-based methods (BG and TREX-DINO), framed as change-point detection on time-series interaction data, with distillation used to cap the bank size. Empirical results on a system-level simulator show that AD-based methods deliver high performance-to-training ratios $\xi$ while achieving comparable cumulative rewards $\rho$ to baselines that train policies for every task, with TD offering robustness to threshold settings. The framework enables practical deployment of multitask RL controllers in real-world RANs and is extendable to other RAN optimization problems like energy saving.
Abstract
The use of learning-based methods for optimizing cellular radio access networks (RAN) has received increasing attention in recent years. This coincides with a rapid increase in the number of cell sites worldwide, driven largely by dramatic growth in cellular network traffic. Training and maintaining learned models that work well across a large number of cell sites has thus become a pertinent problem. This paper proposes a scalable framework for constructing a reinforcement learning policy bank that can perform RAN optimization across a large number of cell sites with varying traffic patterns. Central to our framework is a novel application of anomaly detection techniques to assess the compatibility between sites (tasks) and the policy bank. This allows our framework to intelligently identify when a policy can be reused for a task, and when a new policy needs to be trained and added to the policy bank. Our results show that our approach to compatibility assessment leads to an efficient use of computational resources, by allowing us to construct a performant policy bank without exhaustively training on all tasks, which makes it applicable under real-world constraints.
