Table of Contents
Fetching ...

Offline Reinforcement Learning and Sequence Modeling for Downlink Link Adaptation

Samuele Peri, Alessio Russo, Gabor Fodor, Pablo Soldati

TL;DR

The paper addresses the challenge of downlink LA in 5G/6G RANs by proposing offline RL as a non-invasive alternative to live-network training. It introduces three LA designs—BCQ, CQL, and a DT-based approach—trained on static transition datasets gathered with a DQN behavioral policy, and shows that offline RL can match online RL performance when data quality and coverage are appropriate. Across simulations, BCQ and CQL typically outperform OLLA and approach the performance of online DQN, while DT offers long-horizon sequence modeling with careful RTG conditioning and temporal embeddings. The work demonstrates the practical viability of offline RL for RAN control, and discusses DT design considerations, data collection strategies, and avenues for improving generalization to large-scale deployments and noninvasive data collection.

Abstract

Link adaptation (LA) is an essential function in modern wireless communication systems that dynamically adjusts the transmission rate of a communication link to match time- and frequency-varying radio link conditions. However, factors such as user mobility, fast fading, imperfect channel quality information, and aging of measurements make the modeling of LA challenging. To bypass the need for explicit modeling, recent research has introduced online reinforcement learning (RL) approaches as an alternative to the more commonly used rule-based algorithms. Yet, RL-based approaches face deployment challenges, as training in live networks can potentially degrade real-time performance. To address this challenge, this paper considers offline RL as a candidate to learn LA policies with minimal effects on the network operation. We propose three LA designs based on batch-constrained deep Q-learning, conservative Q-learning, and decision transformer. Our results show that offline RL algorithms can match the performance of state-of-the-art online RL methods when data is collected with a proper behavioral policy.

Offline Reinforcement Learning and Sequence Modeling for Downlink Link Adaptation

TL;DR

The paper addresses the challenge of downlink LA in 5G/6G RANs by proposing offline RL as a non-invasive alternative to live-network training. It introduces three LA designs—BCQ, CQL, and a DT-based approach—trained on static transition datasets gathered with a DQN behavioral policy, and shows that offline RL can match online RL performance when data quality and coverage are appropriate. Across simulations, BCQ and CQL typically outperform OLLA and approach the performance of online DQN, while DT offers long-horizon sequence modeling with careful RTG conditioning and temporal embeddings. The work demonstrates the practical viability of offline RL for RAN control, and discusses DT design considerations, data collection strategies, and avenues for improving generalization to large-scale deployments and noninvasive data collection.

Abstract

Link adaptation (LA) is an essential function in modern wireless communication systems that dynamically adjusts the transmission rate of a communication link to match time- and frequency-varying radio link conditions. However, factors such as user mobility, fast fading, imperfect channel quality information, and aging of measurements make the modeling of LA challenging. To bypass the need for explicit modeling, recent research has introduced online reinforcement learning (RL) approaches as an alternative to the more commonly used rule-based algorithms. Yet, RL-based approaches face deployment challenges, as training in live networks can potentially degrade real-time performance. To address this challenge, this paper considers offline RL as a candidate to learn LA policies with minimal effects on the network operation. We propose three LA designs based on batch-constrained deep Q-learning, conservative Q-learning, and decision transformer. Our results show that offline RL algorithms can match the performance of state-of-the-art online RL methods when data is collected with a proper behavioral policy.

Paper Structure

This paper contains 28 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Episodes in our MDP formulation for LA. Each box represents a TTI, and acknowledgments use the same color as the transmission they respond to. Note that episodes 1 and 2 overlap between TTI 5 and 12.
  • Figure 2: At training time, the proposed attention mask (right matrix) prevents the tokens associated to $p_2$ and $p_4$ from attending $p_1$, as its reward is not available at the time an MCS index is requested for $p_2$ and $p_4$. The left matrix shows the original attention mask definition.
  • Figure 3: Cumulative distribution function (CDF) of user throughput, BLER and SE for various offline RL models trained on $\mathcal{D}_{\rm opt}$. All offline RL algorithms achieve similar performance across all metrics, comparable with the online DQN policy, and outperform the OLLA baseline.
  • Figure 4: Custom LA environment.
  • Figure 5: Q-values at $s_0$.
  • ...and 1 more figures