Table of Contents
Fetching ...

Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy Optimization

Talha Bozkus, Urbashi Mitra

TL;DR

This work tackles policy optimization in large, unknown MDPs typical of network control by introducing nEQL, a model-free ensemble Q-learning framework that runs multiple Q-learning processes on synthetically generated Markovian environments spanning different time scales and fuses their outputs via Jensen–Shannon divergence-based weights. The authors provide theoretical guarantees, including unbiasedness of the ensemble Q-values in the limit and variance bounds that shrink with more environments, supporting convergence to the original optimal policy in the mean-square sense. Empirically, nEQL achieves up to 55% lower average policy error and up to 50% faster runtimes across diverse network models (randomized graphs, cliff-walking, SISO, and MISO wireless) compared to strong Q-learning baselines, validating both scalability and performance gains. The approach offers a practical, scalable pathway to leverage multi-scale environment structure for efficient learning in large discrete-state MDPs.

Abstract

Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original Q-learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple Q-learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and Q-functions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art Q-learning algorithms. Numerical results validate assumptions made in the theoretical analysis.

Multi-Timescale Ensemble Q-learning for Markov Decision Process Policy Optimization

TL;DR

This work tackles policy optimization in large, unknown MDPs typical of network control by introducing nEQL, a model-free ensemble Q-learning framework that runs multiple Q-learning processes on synthetically generated Markovian environments spanning different time scales and fuses their outputs via Jensen–Shannon divergence-based weights. The authors provide theoretical guarantees, including unbiasedness of the ensemble Q-values in the limit and variance bounds that shrink with more environments, supporting convergence to the original optimal policy in the mean-square sense. Empirically, nEQL achieves up to 55% lower average policy error and up to 50% faster runtimes across diverse network models (randomized graphs, cliff-walking, SISO, and MISO wireless) compared to strong Q-learning baselines, validating both scalability and performance gains. The approach offers a practical, scalable pathway to leverage multi-scale environment structure for efficient learning in large discrete-state MDPs.

Abstract

Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original Q-learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple Q-learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and Q-functions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art Q-learning algorithms. Numerical results validate assumptions made in the theoretical analysis.
Paper Structure (27 sections, 7 theorems, 34 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 27 sections, 7 theorems, 34 equations, 9 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Let $u_t$ be a constant: $u_t=u$. Under Assumption (Equ: distribution_assumption), Algorithm Algorithm: ensemble_link_learning produces unbiased $Q$-functions in the limit $\textit{i.e.}$$\lim\limits_{t\rightarrow \infty}\mathbb{E}[\mathcal{E}_t(s,a)] = 0$. If the $Q$-function errors of a given envi

Figures (9)

  • Figure 1: The relationship between $\mathcal{M}^{(1)}$ and $\mathcal{M}^{(n)}$.
  • Figure 2: Classification of Q-Learning (QL) algorithms based on their strategies and implementation.
  • Figure 3: Examples of wireless network models.
  • Figure 4: APE performances across different environments.
  • Figure 5: APE results across different network models.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Proposition 1
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • Proposition 2
  • Proposition 3
  • Proposition 4