Average Reward Reinforcement Learning for Wireless Radio Resource Management

Kun Yang; Jing Yang; Cong Shen

Average Reward Reinforcement Learning for Wireless Radio Resource Management

Kun Yang, Jing Yang, Cong Shen

TL;DR

This work identifies a fundamental mismatch between discounted reward reinforcement learning and the long-term objectives of radio resource management in wireless networks. It reframes RRM as an average-reward optimization problem and develops the Average Reward Off-policy Soft Actor-Critic (ARO-SAC), an extension of SAC that optimizes the average reward by estimating the true average rate $\rho$ and using a differential return. Empirical results on a RAN slicing scenario show that ARO-SAC can outperform standard discounted RL approaches by approximately 15%, while avoiding instability associated with setting $\gamma=1$ for long horizons. The findings demonstrate the practical value of average-reward RL for wireless network optimization and point to future work on theoretical guarantees and hyperparameter tuning for ARO-SAC.

Abstract

In this paper, we address a crucial but often overlooked issue in applying reinforcement learning (RL) to radio resource management (RRM) in wireless communications: the mismatch between the discounted reward RL formulation and the undiscounted goal of wireless network optimization. To the best of our knowledge, we are the first to systematically investigate this discrepancy, starting with a discussion of the problem formulation followed by simulations that quantify the extent of the gap. To bridge this gap, we introduce the use of average reward RL, a method that aligns more closely with the long-term objectives of RRM. We propose a new method called the Average Reward Off policy Soft Actor Critic (ARO SAC) is an adaptation of the well known Soft Actor Critic algorithm in the average reward framework. This new method achieves significant performance improvement our simulation results demonstrate a 15% gain in the system performance over the traditional discounted reward RL approach, underscoring the potential of average reward RL in enhancing the efficiency and effectiveness of wireless network optimization.

Average Reward Reinforcement Learning for Wireless Radio Resource Management

TL;DR

and using a differential return. Empirical results on a RAN slicing scenario show that ARO-SAC can outperform standard discounted RL approaches by approximately 15%, while avoiding instability associated with setting

for long horizons. The findings demonstrate the practical value of average-reward RL for wireless network optimization and point to future work on theoretical guarantees and hyperparameter tuning for ARO-SAC.

Abstract

Paper Structure (12 sections, 11 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 11 equations, 3 figures, 3 tables, 1 algorithm.

Introduction
Related Works
Problem Formulation
RAN Slicing
From RRM to Discounted Reward RL
Detailed Environment Setting
The impact of discount factor and horizon
Average Reward Soft Actor-Critic
Re-formulation
Average Reward SAC
Experiments
Conclusion

Figures (3)

Figure 1: Illustration of a RAN slicing system
Figure 2: Experimental results with $\gamma = 1$. Shadowed areas indicate the confidence intervals.
Figure 3: Experimental result using ARO-SAC, where the experiment is averaged over 5 independent runs over 5 different combinations of user numbers.

Average Reward Reinforcement Learning for Wireless Radio Resource Management

TL;DR

Abstract

Average Reward Reinforcement Learning for Wireless Radio Resource Management

Authors

TL;DR

Abstract

Table of Contents

Figures (3)