Deep Contextual Bandit and Reinforcement Learning for IRS-Assisted MU-MIMO Systems

Dariel Pereira-Ruisánchez; Óscar Fresnedo; Darian Pérez-Adán; Luis Castedo

Deep Contextual Bandit and Reinforcement Learning for IRS-Assisted MU-MIMO Systems

Dariel Pereira-Ruisánchez, Óscar Fresnedo, Darian Pérez-Adán, Luis Castedo

TL;DR

This work tackles the problem of jointly optimizing the IRS phase-shift matrix and MIMO precoders in an IRS-assisted MU-MIMO uplink to maximize the sum-rate $R_{ ext{sum}}$. It introduces two learning-based frameworks: a contextual bandit approach with continuous state/action spaces (DCB-DDPG) and a DRL formulation aligned with an MDP (DRL-DDPG), each paired with specialized ANN architectures. The paper details the state/action/reward definitions, framework elements, training algorithms, and network structures, and provides a comparative analysis of convergence properties and computational complexity. Simulation results show that both approaches outperform heuristic baselines under strong multi-user interference, with DCB-DDPG offering superior stability and lower complexity, while DRL-DDPG demonstrates effective long-horizon learning. The findings suggest that continuous-valued formulations are advantageous for IRS-MIMO optimization and remain practical when discretization is later applied.

Abstract

The combination of multiple-input multiple-output (MIMO) systems and intelligent reflecting surfaces (IRSs) is foreseen as a critical enabler of beyond 5G (B5G) and 6G. In this work, two different approaches are considered for the joint optimization of the IRS phase-shift matrix and MIMO precoders of an IRS-assisted multi-stream (MS) multi-user MIMO (MU-MIMO) system. Both approaches aim to maximize the system sum-rate for every channel realization. The first proposed solution is a novel contextual bandit (CB) framework with continuous state and action spaces called deep contextual bandit-oriented deep deterministic policy gradient (DCB-DDPG). The second is an innovative deep reinforcement learning (DRL) formulation where the states, actions, and rewards are selected such that the Markov decision process (MDP) property of reinforcement learning (RL) is appropriately met. Both proposals perform remarkably better than state-of-the-art heuristic methods in scenarios with high multi-user interference.

Deep Contextual Bandit and Reinforcement Learning for IRS-Assisted MU-MIMO Systems

TL;DR

This work tackles the problem of jointly optimizing the IRS phase-shift matrix and MIMO precoders in an IRS-assisted MU-MIMO uplink to maximize the sum-rate

. It introduces two learning-based frameworks: a contextual bandit approach with continuous state/action spaces (DCB-DDPG) and a DRL formulation aligned with an MDP (DRL-DDPG), each paired with specialized ANN architectures. The paper details the state/action/reward definitions, framework elements, training algorithms, and network structures, and provides a comparative analysis of convergence properties and computational complexity. Simulation results show that both approaches outperform heuristic baselines under strong multi-user interference, with DCB-DDPG offering superior stability and lower complexity, while DRL-DDPG demonstrates effective long-horizon learning. The findings suggest that continuous-valued formulations are advantageous for IRS-MIMO optimization and remain practical when discretization is later applied.

Abstract

Paper Structure (24 sections, 11 equations, 15 figures, 3 tables, 2 algorithms)

This paper contains 24 sections, 11 equations, 15 figures, 3 tables, 2 algorithms.

Introduction
Preliminaries
Reinforcement Learning (RL) and Contextual Bandit (CB)
Related Works
Overcoming the Limitations
System Model and Optimization Problem
Notation
IRS-assisted MS MU-MIMO Uplink
Channel Model
CB-based Joint Optimization
DCB-DDPG: State, Action and Reward
DCB-DDPG: Framework Elements
DCB-DDPG: Proposed Algorithm
DCB-DDPG: ANN Structure
RL-based Joint Optimization
...and 9 more sections

Figures (15)

Figure 1: Contextual bandit (CB) framework.
Figure 2: Reinforcement learning (RL) framework.
Figure 3: Uplink of an IRS-assisted MS MU-MIMO system.
Figure 4: Elements in the DCB-DDPG agent.
Figure 5: DCB-DDPG actor and critic network structure.
...and 10 more figures

Deep Contextual Bandit and Reinforcement Learning for IRS-Assisted MU-MIMO Systems

TL;DR

Abstract

Deep Contextual Bandit and Reinforcement Learning for IRS-Assisted MU-MIMO Systems

Authors

TL;DR

Abstract

Table of Contents

Figures (15)