Table of Contents
Fetching ...

Continual Model-based Reinforcement Learning for Data Efficient Wireless Network Optimisation

Cengis Hasan, Alexandros Agapitos, David Lynch, Alberto Castagna, Giorgio Cruciata, Hao Wang, Aleksandar Milenovic

TL;DR

The paper tackles the long lead-time of deploying cell-level parameter optimisations in new wireless-network sites by formulating stage-wise parameter subset tuning as continual model-based reinforcement learning over expanding action spaces. It combines a probabilistic reward ensemble, autoencoder-based state compression, and the Progress-&-Compress framework to achieve data-efficient transfer across tasks while avoiding catastrophic forgetting. Empirical results show a two-fold reduction in deployment lead-time, up to a 4% throughput gain, and substantial reductions in data needs, memory, and training time, with 80 ms inference for large-scale networks. The work demonstrates practical gains in adapting to diverse sites and traffic conditions, and outlines future directions in causal structure learning and sim-to-real policy warm-starting to further improve efficiency and robustness.

Abstract

We present a method that addresses the pain point of long lead-time required to deploy cell-level parameter optimisation policies to new wireless network sites. Given a sequence of action spaces represented by overlapping subsets of cell-level configuration parameters provided by domain experts, we formulate throughput optimisation as Continual Reinforcement Learning of control policies. Simulation results suggest that the proposed system is able to shorten the end-to-end deployment lead-time by two-fold compared to a reinitialise-and-retrain baseline without any drop in optimisation gain.

Continual Model-based Reinforcement Learning for Data Efficient Wireless Network Optimisation

TL;DR

The paper tackles the long lead-time of deploying cell-level parameter optimisations in new wireless-network sites by formulating stage-wise parameter subset tuning as continual model-based reinforcement learning over expanding action spaces. It combines a probabilistic reward ensemble, autoencoder-based state compression, and the Progress-&-Compress framework to achieve data-efficient transfer across tasks while avoiding catastrophic forgetting. Empirical results show a two-fold reduction in deployment lead-time, up to a 4% throughput gain, and substantial reductions in data needs, memory, and training time, with 80 ms inference for large-scale networks. The work demonstrates practical gains in adapting to diverse sites and traffic conditions, and outlines future directions in causal structure learning and sim-to-real policy warm-starting to further improve efficiency and robustness.

Abstract

We present a method that addresses the pain point of long lead-time required to deploy cell-level parameter optimisation policies to new wireless network sites. Given a sequence of action spaces represented by overlapping subsets of cell-level configuration parameters provided by domain experts, we formulate throughput optimisation as Continual Reinforcement Learning of control policies. Simulation results suggest that the proposed system is able to shorten the end-to-end deployment lead-time by two-fold compared to a reinitialise-and-retrain baseline without any drop in optimisation gain.
Paper Structure (20 sections, 11 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 20 sections, 11 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: High dimensional raw network state $s^{raw}_t \in \mathbb{R}^{410}$ at hour $t$ is compressed using a modular auto-encoder to $s_t \in \mathbb{R}^{50}$. The reward model is an ensemble of probabilistic neural networks. A single component model $m$ is displayed.
  • Figure 2: Analysis of the reward model.
  • Figure 3: Training policies using progress and compress framework. Two hidden layers, called as shared and branch, are dense layers with $128$ and $64$ neurons, respectively, with $\mathrm{tanh}$ non-linearity.
  • Figure 4: Learning curves. Blue shading shows standard deviation over 20 independent runs.
  • Figure 5: Assessment of TP gain.
  • ...and 1 more figures