Table of Contents
Fetching ...

Meta-Learning Multi-armed Bandits for Beam Tracking in 5G and 6G Networks

Alexander Mattick, George Yammine, Georgios Kontes, Setareh Maghsudi, Christopher Mutschler

TL;DR

This work tackles beam management in mmWave networks by formulating beam selection as a partially observable restless multi-armed bandit problem. It introduces a meta-learning approach that amortizes posterior inference via stochastic variational inference, decomposing the problem into a bandit-search head and a goal-predictor head to handle UE movement and environmental dynamics. During deployment, a fast online inference step mirrors Thompson sampling, enabling real-time beam decisions using RSS feedback only. Empirical results show robust generalization across trajectories, environments, and codebook sizes, outperforming state-of-the-art baselines and offering substantial reductions in probe requirements. The approach provides a scalable framework for online RMAB inference in dynamic wireless settings with practical relevance to 5G/6G beam management.

Abstract

Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects \textit{optimal} beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams make an optimal beam selection challenging. In contrast to previous work and standardization efforts that opt for supervised learning to train classifiers to predict the next best beam based on previously selected beams we formulate the problem as a partially observable Markov decision process (POMDP) and model the environment as the codebook itself. At each time step, we select a candidate beam conditioned on the belief state of the unobservable optimal beam and previously probed beams. This frames the beam selection problem as an online search procedure that locates the moving optimal beam. In contrast to previous work, our method handles new or unforeseen trajectories and changes in the physical environment, and outperforms previous work by orders of magnitude.

Meta-Learning Multi-armed Bandits for Beam Tracking in 5G and 6G Networks

TL;DR

This work tackles beam management in mmWave networks by formulating beam selection as a partially observable restless multi-armed bandit problem. It introduces a meta-learning approach that amortizes posterior inference via stochastic variational inference, decomposing the problem into a bandit-search head and a goal-predictor head to handle UE movement and environmental dynamics. During deployment, a fast online inference step mirrors Thompson sampling, enabling real-time beam decisions using RSS feedback only. Empirical results show robust generalization across trajectories, environments, and codebook sizes, outperforming state-of-the-art baselines and offering substantial reductions in probe requirements. The approach provides a scalable framework for online RMAB inference in dynamic wireless settings with practical relevance to 5G/6G beam management.

Abstract

Beamforming-capable antenna arrays with many elements enable higher data rates in next generation 5G and 6G networks. In current practice, analog beamforming uses a codebook of pre-configured beams with each of them radiating towards a specific direction, and a beam management function continuously selects \textit{optimal} beams for moving user equipments (UEs). However, large codebooks and effects caused by reflections or blockages of beams make an optimal beam selection challenging. In contrast to previous work and standardization efforts that opt for supervised learning to train classifiers to predict the next best beam based on previously selected beams we formulate the problem as a partially observable Markov decision process (POMDP) and model the environment as the codebook itself. At each time step, we select a candidate beam conditioned on the belief state of the unobservable optimal beam and previously probed beams. This frames the beam selection problem as an online search procedure that locates the moving optimal beam. In contrast to previous work, our method handles new or unforeseen trajectories and changes in the physical environment, and outperforms previous work by orders of magnitude.

Paper Structure

This paper contains 22 sections, 9 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Schematic view of our approach: an agent (carrying a UE) moves through the environment. The line-of-sight (LOS) between the UE and the base station might be blocked and signals might be received via reflectors. During UE tracking, we select beam IDs from the codebook (orange) while the optimal beams are shown in green (darkness of colors encode larger timestamps). The right panel illustrates the learning process based on available trajectories: the left side depicts the variational distribution, and the right side shows the model distribution. The upper half models the unobservable goal state of the problem, while the lower half represents the exploration state.
  • Figure 2: Visualization of the simulation environment: The UE (black cross) moves around the environment following the dashed path. At each step, the link between the BSs (black square) and the UE is checked for obstructions. Here, if a link is obstructed it is plotted in red and blue otherwise.
  • Figure 3:
  • Figure 4: Total RSS of PPO vs. our approach.
  • Figure 5: Uncertainty of beam quality over time as measured using our XGBoost model.