Table of Contents
Fetching ...

A quantum-classical reinforcement learning model to play Atari games

Dominik Freinberger, Julian Lemmel, Radu Grosu, Sofiene Jerbi

TL;DR

This work investigates a hybrid quantum-classical reinforcement learning framework for high-dimensional observation spaces, evaluated on Atari games Pong and Breakout. It combines classical feature extraction with a parametrized quantum circuit (PQC) that encodes latent features and outputs Q-values via local Pauli-$Z$ measurements, with a linear post-processing layer guiding learning through approximate Q-learning with replay and a target network. The study shows that the hybrid model can learn Pong and approach the classical baseline in Breakout, and it analyzes how design choices—especially latent-space dimensionality and reward scaling—shape performance, offering guidance for fair benchmarking of quantum components. While no quantum advantage is demonstrated, the results advance understanding of quantum-classical interplay in RL and point to directions such as robustness to noise and task domains where quantum effects may be more beneficial.

Abstract

Recent advances in reinforcement learning have demonstrated the potential of quantum learning models based on parametrized quantum circuits as an alternative to deep learning models. On the one hand, these findings have shown the ultimate exponential speed-ups in learning that full-blown quantum models can offer in certain -- artificially constructed -- environments. On the other hand, they have demonstrated the ability of experimentally accessible PQCs to solve OpenAI Gym benchmarking tasks. However, it remains an open question whether these near-term QRL techniques can be successfully applied to more complex problems exhibiting high-dimensional observation spaces. In this work, we bridge this gap and present a hybrid model combining a PQC with classical feature encoding and post-processing layers that is capable of tackling Atari games. A classical model, subjected to architectural restrictions similar to those present in the hybrid model is constructed to serve as a reference. Our numerical investigation demonstrates that the proposed hybrid model is capable of solving the Pong environment and achieving scores comparable to the classical reference in Breakout. Furthermore, our findings shed light on important hyperparameter settings and design choices that impact the interplay of the quantum and classical components. This work contributes to the understanding of near-term quantum learning models and makes an important step towards their deployment in real-world RL scenarios.

A quantum-classical reinforcement learning model to play Atari games

TL;DR

This work investigates a hybrid quantum-classical reinforcement learning framework for high-dimensional observation spaces, evaluated on Atari games Pong and Breakout. It combines classical feature extraction with a parametrized quantum circuit (PQC) that encodes latent features and outputs Q-values via local Pauli- measurements, with a linear post-processing layer guiding learning through approximate Q-learning with replay and a target network. The study shows that the hybrid model can learn Pong and approach the classical baseline in Breakout, and it analyzes how design choices—especially latent-space dimensionality and reward scaling—shape performance, offering guidance for fair benchmarking of quantum components. While no quantum advantage is demonstrated, the results advance understanding of quantum-classical interplay in RL and point to directions such as robustness to noise and task domains where quantum effects may be more beneficial.

Abstract

Recent advances in reinforcement learning have demonstrated the potential of quantum learning models based on parametrized quantum circuits as an alternative to deep learning models. On the one hand, these findings have shown the ultimate exponential speed-ups in learning that full-blown quantum models can offer in certain -- artificially constructed -- environments. On the other hand, they have demonstrated the ability of experimentally accessible PQCs to solve OpenAI Gym benchmarking tasks. However, it remains an open question whether these near-term QRL techniques can be successfully applied to more complex problems exhibiting high-dimensional observation spaces. In this work, we bridge this gap and present a hybrid model combining a PQC with classical feature encoding and post-processing layers that is capable of tackling Atari games. A classical model, subjected to architectural restrictions similar to those present in the hybrid model is constructed to serve as a reference. Our numerical investigation demonstrates that the proposed hybrid model is capable of solving the Pong environment and achieving scores comparable to the classical reference in Breakout. Furthermore, our findings shed light on important hyperparameter settings and design choices that impact the interplay of the quantum and classical components. This work contributes to the understanding of near-term quantum learning models and makes an important step towards their deployment in real-world RL scenarios.

Paper Structure

This paper contains 23 sections, 11 equations, 14 figures, 1 table, 1 algorithm.

Figures (14)

  • Figure 1: The hybrid quantum-classical architecture. Three convolutional layers reduce the high-dimensional input and extract a low number of features which are further combined and reduced by a linear pre-processing layer to create a highly informative latent representation. The PQC encodes these latent features and outputs the expectation values of local Pauli-$Z$ measurements. The output of the PQC is further post-processed by another fully connected layer with linear activation to match the Q-value magnitudes and action space dimension.
  • Figure 2: A parameterized quantum circuit as a machine learning model. A feature vector $\boldsymbol{x}$ is encoded into the quantum system in its trivial state $\ket{0}^{\otimes n}$ via the repeated encoding unitaries $U_l(\boldsymbol{x})$ (red). Intermediate variational unitaries $V_l(\boldsymbol{\theta})$ (blue) enable the training of the circuit. The output of the model $f_{\boldsymbol{\theta}}(\boldsymbol{x})$ is the expectation value $\braket{\boldsymbol{\mathcal{M}}}_{\boldsymbol{x}, \boldsymbol{\theta}}$ of a (or multiple) observables (e.g., a Pauli-$Z$ observable on each qubit) measured at the end of the circuit.
  • Figure 3: Rewards obtained during training for the hybrid and classical baseline models. Shaded areas indicate the standard deviation of multiple runs. Left: The hybrid agent (blue) and the classical reference (grey) show differences in learning performance in Pong. In this environment, the hybrid agent appears to learn faster and more consistently across multiple runs. Right: Hybrid agent (blue) and classical reference (grey) in Breakout. Here, the hybrid model achieves a strong performance but shows a 41% gap compared to the reference model.
  • Figure 4: Visualizing learned Q-values. We plot the output of the hybrid and classical baseline models as a function of two randomly chosen inputs to the pre-processing layer, in place of the output generated by the convolutional layers in Figure \ref{['fig:hybrid-quantum-classical-model']}. Left: Close-up of the Q-value surface, where the range of input values reflects a typical range as observed during an episode of Breakout. Right: The Q-value surface over an expanded region of the input space.
  • Figure 5: Analysis of the impact of reward rescaling in Breakout. Left: The hybrid baseline model (blue) as well as a hybrid model trained with 10x scaled rewards and a final layer learning rate of 2.5e-2 (setting 1c) and a hybrid model trained with 100x reward scaling and a final layer learning rate of 2.5e-1 (setting 1f). Right: The classical reference model (blue) and a classic reference trained with 10x reward scaling and final layer learning rate of 2.5e-2 (setting 1a) and 100x reward scaling and final layer learning rate of 2.5e-1 (setting 1b). While the hybrid model benefits from the modifications, the performance of the classical model deteriorates.
  • ...and 9 more figures