Table of Contents
Fetching ...

Benchmarking Quantum Reinforcement Learning

Nico Meyer, Christian Ufrecht, George Yammine, Georgios Kontes, Christopher Mutschler, Daniel D. Scherer

TL;DR

The paper tackles the lack of standardized benchmarking in reinforcement learning and quantum reinforcement learning by introducing a statistically grounded sample-complexity estimator and a rigorously defined outperformance criterion. It pairs this methodology with a flexible BeamManagement6G benchmark to enable scalable, reproducible comparisons between classical and hybrid quantum agents, evaluating both off-policy DDQN and on-policy PPO. Across extensive, 100-seed experiments, hybrid quantum models demonstrate competitive sample efficiency and occasional advantages over similarly sized classical networks, though no definitive quantum advantage emerges without scaling to larger, hardware-capable qubit counts. The work emphasizes the empirical nature of quantum advantage, calls for larger-scale studies, and provides open-source tools to promote rigorous, reproducible benchmarking in QRL.

Abstract

Benchmarking and establishing proper statistical validation metrics for reinforcement learning (RL) remain ongoing challenges, where no consensus has been established yet. The emergence of quantum computing and its potential applications in quantum reinforcement learning (QRL) further complicate benchmarking efforts. To enable valid performance comparisons and to streamline current research in this area, we propose a novel benchmarking methodology, which is based on a statistical estimator for sample complexity and a definition of statistical outperformance. Furthermore, considering QRL, our methodology casts doubt on some previous claims regarding its superiority. We conducted experiments on a novel benchmarking environment with flexible levels of complexity. While we still identify possible advantages, our findings are more nuanced overall. We discuss the potential limitations of these results and explore their implications for empirical research on quantum advantage in QRL.

Benchmarking Quantum Reinforcement Learning

TL;DR

The paper tackles the lack of standardized benchmarking in reinforcement learning and quantum reinforcement learning by introducing a statistically grounded sample-complexity estimator and a rigorously defined outperformance criterion. It pairs this methodology with a flexible BeamManagement6G benchmark to enable scalable, reproducible comparisons between classical and hybrid quantum agents, evaluating both off-policy DDQN and on-policy PPO. Across extensive, 100-seed experiments, hybrid quantum models demonstrate competitive sample efficiency and occasional advantages over similarly sized classical networks, though no definitive quantum advantage emerges without scaling to larger, hardware-capable qubit counts. The work emphasizes the empirical nature of quantum advantage, calls for larger-scale studies, and provides open-source tools to promote rigorous, reproducible benchmarking in QRL.

Abstract

Benchmarking and establishing proper statistical validation metrics for reinforcement learning (RL) remain ongoing challenges, where no consensus has been established yet. The emergence of quantum computing and its potential applications in quantum reinforcement learning (QRL) further complicate benchmarking efforts. To enable valid performance comparisons and to streamline current research in this area, we propose a novel benchmarking methodology, which is based on a statistical estimator for sample complexity and a definition of statistical outperformance. Furthermore, considering QRL, our methodology casts doubt on some previous claims regarding its superiority. We conducted experiments on a novel benchmarking environment with flexible levels of complexity. While we still identify possible advantages, our findings are more nuanced overall. We discuss the potential limitations of these results and explore their implications for empirical research on quantum advantage in QRL.

Paper Structure

This paper contains 24 sections, 2 theorems, 33 equations, 20 figures, 3 tables, 2 algorithms.

Key Result

Theorem 2.2

$\hat{S}$ is consistent, that is for all $\eta>0$ we find that $\lim _{N\to \infty }\ P (|\hat{S}-S |>\eta )=0$

Figures (20)

  • Figure 1: Comparison of empirical sample complexities $\hat{S}$ of double deep Q-learning and a quantum version of the algorithm (lower is better). Sample complexity is the number of environment-agent interactions to surpass a performance threshold $1-\varepsilon$ with probability $\delta$. The figure shows the result for the BeamManagement6G environment introduced in this work. In order of decreasing sample complexity: a small classical neural network with $2$ hidden layers of width $16$, i.e., $387$ parameters; a small quantum circuit with $4$ layers on $14$ qubits, i.e., $437$ variational parameters, integrated between fully connected classical layers with additional $101$ parameters; a large classical neural network with $2$ hidden layers of width $64$, i.e., $4611$ parameters; The hybrid quantum model consistently outperforms the similar-sized classical network, and is also competitive with the $10$-fold larger classical model.
  • Figure 2: Inadequate reporting of two (Q)RL agents' performance can lead to false conclusions about sampling complexity. Although the curves may seem exaggerated, it is common practice in QRL studies to benchmark with such a limited number of runs.
  • Figure 3: Hybrid classical-quantum neural network for an exemplary $4$-qubit quantum layer. The dimensionality of the observation is mapped to the number of qubits in the quantum circuit using a fully-connected layer. The vqc -- for details on the ansatz see \ref{['fig:ansatz']} in \ref{['app:model']} -- acts as a hidden layer, and all qubits are measured individually in the Pauli-Z basis. These measurement results are post-processed using another fully-connected layer, mapping to the number of actions.
  • Figure 4: The figure exemplarily shows two learning curves generated by two different algorithms or algorithmic settings (algorithm 1 and algorithm 2). While algorithm 2 exhibits lower sample complexity with respect to threshold $V_2^*$ than algorithm 1 (for this particular training run), the converse is true for threshold $V_1^*$, which algorithm 2 may even never reach. Consequently, if convergence to optimality cannot be proved for the algorithm, sample complexity is well defined only with respect to a given threshold.
  • Figure 5: The BeamManagement6G environment consists of a set of antennas $A \in$ Antenna, for which at any point in time only one is active. Furthermore, each antenna is equipped with multiple beams, also referred to as codebook element$B \in$ Codebook, which are selected automatically. A user moves through the environment, is targeted by one of the antennas, and receives some intensity $I \in$ Intensity. Based on this observation, i.e., the active antenna $A_{t-1}$, beam $B_{t-1}$, and intensity $I_{t-1}$ at the previous time step $t-1$, the task is to select the optimal antenna for the next timestep $t$, i.e., the $A_t$ providing the greatest intensity value $I_t$ to the user. The objective is to maximize the sum of received intensities over the entire trajectory. Note, that the spatial position of the user is unknown, as localization induces unreasonable real-world overhead, and furthermore collides with user privacy concerns.
  • ...and 15 more figures

Theorems & Definitions (7)

  • Definition 4.1: Estimator empirical sample complexity
  • Definition 4.2: Significant Outperformance
  • Definition 2.1: Estimator empirical sample complexity -- general case
  • Theorem 2.2: Consistency
  • proof
  • Theorem 2.3: Bias
  • proof