Table of Contents
Fetching ...

Generalization in Reinforcement Learning for Radio Access Networks

Burak Demirel, Yu Wang, Cristian Tatino, Pablo Soldati

TL;DR

This work tackles the problem of RL generalization in dynamic, heterogeneous radio access networks by proposing a generalization-centered framework that combines robust state reconstruction, graph-based network representations, domain randomization, and distributed learning. It introduces a distributed RL architecture and a novel link-adaptation case study, showing that policies trained under diverse, simulated conditions can outperform traditional OLLA baselines by up to about $10\%$ in FB scenarios and $>20\%$ under high mobility, while graph-based encodings (notably GAT) yield up to $30\%$ higher throughput in larger deployments. The results demonstrate zero-shot or near-zero-shot generalization across unseen radio conditions, suggesting a scalable path toward an AI-native 6G RAN with a single generalizable RL agent. Practically, the approach enables sim2sim and field-data integration, reducing retraining needs and offering robust, interoperable AI components for ORAN-standardized architectures.

Abstract

Modern RAN operate in highly dynamic and heterogeneous environments, where hand-tuned, rule-based RRM algorithms often underperform. While RL can surpass such heuristics in constrained settings, the diversity of deployments and unpredictable radio conditions introduce major generalization challenges. Data-driven policies frequently overfit to training conditions, degrading performance in unseen scenarios. To address this, we propose a generalization-centered RL framework for RAN control that: (i) robustly reconstructs dynamically varying states from partial and noisy observations, while encoding static and semi-static information, such as radio nodes, cell attributes, and their topology, through graph representations; (ii) applies domain randomization to broaden the training distribution; and (iii) distributes data generation across multiple actors while centralizing training in a cloud-compatible architecture aligned with O-RAN principles. Although generalization increases computational and data-management complexity, our distributed design mitigates this by scaling data collection and training across diverse network conditions. Applied to downlink link adaptation in five 5G benchmarks, our policy improves average throughput and spectral efficiency by ~10% over an OLLA baseline (10% BLER target) in full-buffer MIMO/mMIMO and by >20% under high mobility. It matches specialized RL in full-buffer traffic and achieves up to 4- and 2-fold gains in eMBB and mixed-traffic benchmarks, respectively. In nine-cell deployments, GAT models offer 30% higher throughput over MLP baselines. These results, combined with our scalable architecture, offer a path toward AI-native 6G RAN using a single, generalizable RL agent.

Generalization in Reinforcement Learning for Radio Access Networks

TL;DR

This work tackles the problem of RL generalization in dynamic, heterogeneous radio access networks by proposing a generalization-centered framework that combines robust state reconstruction, graph-based network representations, domain randomization, and distributed learning. It introduces a distributed RL architecture and a novel link-adaptation case study, showing that policies trained under diverse, simulated conditions can outperform traditional OLLA baselines by up to about in FB scenarios and under high mobility, while graph-based encodings (notably GAT) yield up to higher throughput in larger deployments. The results demonstrate zero-shot or near-zero-shot generalization across unseen radio conditions, suggesting a scalable path toward an AI-native 6G RAN with a single generalizable RL agent. Practically, the approach enables sim2sim and field-data integration, reducing retraining needs and offering robust, interoperable AI components for ORAN-standardized architectures.

Abstract

Modern RAN operate in highly dynamic and heterogeneous environments, where hand-tuned, rule-based RRM algorithms often underperform. While RL can surpass such heuristics in constrained settings, the diversity of deployments and unpredictable radio conditions introduce major generalization challenges. Data-driven policies frequently overfit to training conditions, degrading performance in unseen scenarios. To address this, we propose a generalization-centered RL framework for RAN control that: (i) robustly reconstructs dynamically varying states from partial and noisy observations, while encoding static and semi-static information, such as radio nodes, cell attributes, and their topology, through graph representations; (ii) applies domain randomization to broaden the training distribution; and (iii) distributes data generation across multiple actors while centralizing training in a cloud-compatible architecture aligned with O-RAN principles. Although generalization increases computational and data-management complexity, our distributed design mitigates this by scaling data collection and training across diverse network conditions. Applied to downlink link adaptation in five 5G benchmarks, our policy improves average throughput and spectral efficiency by ~10% over an OLLA baseline (10% BLER target) in full-buffer MIMO/mMIMO and by >20% under high mobility. It matches specialized RL in full-buffer traffic and achieves up to 4- and 2-fold gains in eMBB and mixed-traffic benchmarks, respectively. In nine-cell deployments, GAT models offer 30% higher throughput over MLP baselines. These results, combined with our scalable architecture, offer a path toward AI-native 6G RAN using a single, generalizable RL agent.

Paper Structure

This paper contains 37 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: UE-centric representation of multi-cell relationships in cellular networks. (a) Radio environment snapshot as observed by a UE, reporting reference signal measurements from its serving cell (green) and its eight strongest neighboring cells (blue), each characterized by antenna configuration, bandwidth, and frequency. (b) Abstracted graph model in which the UE’s serving cell (central node, $h_{0}$) connects to neighboring cell nodes $h_{1}-h_{8}$. Edge thickness and node ordering represent measurement strength (e.g., RSRP), capturing inter-cell interference and handover affinities relevant to downstream machine learning tasks.
  • Figure 2: Overview of a scalable architecture for a distributed RL system integrated with wireless network simulators. A GPU-based learner updates neural network weights using batches sampled from a distributed replay memory containing experience trajectories. Updated weights are periodically shared with a set of CPU-based actors, which concurrently interact with diverse simulation environments, emulating variable channel conditions and user behaviors. The resulting experience streams are pushed to the replay memory, with sampling priorities updated based on the learner feedback.
  • Figure 3: An AI architecture integrating distributed learning in a 5G RAN system, with the learning engine realized either as a standardized function within the network automation layer or as a standalone support function. A centralized learning engine performs model training, data management, and model lifecycle management, while distributed RAN network functions and r/xApps host inference (actor) functions that generate training data from local environments. Standardized and proprietary interfaces support the exchange of raw and processed data, control information, and model artifacts between the learning engine, SMO/OAM components, and RAN functions, enabling deployment across different layers of the RAN architecture.
  • Figure 4: Markov decision process formulation for downlink link adaptation and HARQ in wireless communication. The figure illustrates how episodes in an MDP are structured to model the interaction between the BS and UE during downlink transmission under a hybrid HARQ protocol. Each episode corresponds to the transmission of a single packet and comprises a sequence of steps, each defined by a state ($s_t$), an action ($a_t$), and a reward ($r_t$). The process begins with the arrival of a new packet at the BS (shown in dark blue) and proceeds through repeated transmissions, each followed by either a NACK or an ACK from the UE. Failed transmissions are marked in red, and successful transmissions in green. The corresponding NACK and ACK signals are indicated in light red and light green, respectively. Light blue segments represent retransmissions of the same packet due to decoding failures, repeated until successful reception is confirmed via an ACK.
  • Figure 5: Per-actor replay-memory ingestion rates during large-scale distributed training. Each curve shows the number of 500-sample batches written per minute by four representative actors (IDs 0, 10, 20, and 30) to their assigned shards over a 30-hour run. After an initial ramp-up and shard initialization jitter (14 May, 17:00--19:00 UTC), all actors converge to a steady rate of 18--20 batches/minute. The uniformity of these curves confirms consistent ingestion throughput and balanced load across the actor population.
  • ...and 7 more figures