Table of Contents
Fetching ...

Dissecting Quantum Reinforcement Learning: A Systematic Evaluation of Key Components

Javier Lazaro, Juan-Ignacio Vazquez, Pablo Garcia-Bringas

TL;DR

This work addresses the training instability and attribution challenges in PQC-based Quantum Reinforcement Learning (QRL) by performing a controlled, component-wise evaluation of three critical blocks: post-PQC inference, observation embedding via Data Reuploading (DR), and PQC ansatz design with entanglement. Using a unified PPO–CartPole framework and the SimplyQRL benchmarking suite, the authors show that Output Reuse (OR) yields gains only when paired with a meaningful quantum readout, that DR consistently enhances trainability with embedding-dependent scaling, and that entanglement effects are architecture-specific, sometimes degrading optimization. The study provides a reproducible protocol and open dataset for principled ablations, enabling causal attribution of quantum–classical synergy and guiding principled QRL design. Collectively, these findings advance our understanding of how to compose quantum and classical components in hybrid QRL, offering practical guidance for robust, scalable quantum-enhanced RL systems.

Abstract

Parameterised quantum circuit (PQC) based Quantum Reinforcement Learning (QRL) has emerged as a promising paradigm at the intersection of quantum computing and reinforcement learning (RL). By design, PQCs create hybrid quantum-classical models, but their practical applicability remains uncertain due to training instabilities, barren plateaus (BPs), and the difficulty of isolating the contribution of individual pipeline components. In this work, we dissect PQC based QRL architectures through a systematic experimental evaluation of three aspects recurrently identified as critical: (i) data embedding strategies, with Data Reuploading (DR) as an advanced approach; (ii) ansatz design, particularly the role of entanglement; and (iii) post-processing blocks after quantum measurement, with a focus on the underexplored Output Reuse (OR) technique. Using a unified PPO-CartPole framework, we perform controlled comparisons between hybrid and classical agents under identical conditions. Our results show that OR, though purely classical, exhibits distinct behaviour in hybrid pipelines, that DR improves trainability and stability, and that stronger entanglement can degrade optimisation, offsetting classical gains. Together, these findings provide controlled empirical evidence of the interplay between quantum and classical contributions, and establish a reproducible framework for systematic benchmarking and component-wise analysis in QRL.

Dissecting Quantum Reinforcement Learning: A Systematic Evaluation of Key Components

TL;DR

This work addresses the training instability and attribution challenges in PQC-based Quantum Reinforcement Learning (QRL) by performing a controlled, component-wise evaluation of three critical blocks: post-PQC inference, observation embedding via Data Reuploading (DR), and PQC ansatz design with entanglement. Using a unified PPO–CartPole framework and the SimplyQRL benchmarking suite, the authors show that Output Reuse (OR) yields gains only when paired with a meaningful quantum readout, that DR consistently enhances trainability with embedding-dependent scaling, and that entanglement effects are architecture-specific, sometimes degrading optimization. The study provides a reproducible protocol and open dataset for principled ablations, enabling causal attribution of quantum–classical synergy and guiding principled QRL design. Collectively, these findings advance our understanding of how to compose quantum and classical components in hybrid QRL, offering practical guidance for robust, scalable quantum-enhanced RL systems.

Abstract

Parameterised quantum circuit (PQC) based Quantum Reinforcement Learning (QRL) has emerged as a promising paradigm at the intersection of quantum computing and reinforcement learning (RL). By design, PQCs create hybrid quantum-classical models, but their practical applicability remains uncertain due to training instabilities, barren plateaus (BPs), and the difficulty of isolating the contribution of individual pipeline components. In this work, we dissect PQC based QRL architectures through a systematic experimental evaluation of three aspects recurrently identified as critical: (i) data embedding strategies, with Data Reuploading (DR) as an advanced approach; (ii) ansatz design, particularly the role of entanglement; and (iii) post-processing blocks after quantum measurement, with a focus on the underexplored Output Reuse (OR) technique. Using a unified PPO-CartPole framework, we perform controlled comparisons between hybrid and classical agents under identical conditions. Our results show that OR, though purely classical, exhibits distinct behaviour in hybrid pipelines, that DR improves trainability and stability, and that stronger entanglement can degrade optimisation, offsetting classical gains. Together, these findings provide controlled empirical evidence of the interplay between quantum and classical contributions, and establish a reproducible framework for systematic benchmarking and component-wise analysis in QRL.

Paper Structure

This paper contains 27 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: Schematic of a standard hybrid QRL pipeline integrating a parameterised quantum circuit (PQC) into a classical RL loop. The diagram highlights the three principal components analysed in this work: data embedding $U(\bar{x})$, variational ansatz $W(\Theta)$, and post-measurement inference, through which observations are encoded, processed quantum-mechanically, and interpreted classically to produce actions.
  • Figure 2: Illustration of the Output Reuse (OR) technique hsiao_unentangled_2022. The PQC output vector (A–D) is replicated $N$ times before the classical interpretation layer, increasing the input dimensionality of the neural head.
  • Figure 3: Comparison of the two angle-embedding philosophies evaluated. (a) Skolik-style embedding encodes one feature per qubit through a single $R_X(x)$ rotation.(b) UQC-style embedding follows a $R_ZR_YR_Z$ pattern that encodes multiple data components per qubit and naturally extends to DR perez-salinas_data_2020. The dotted area marks one DR layer, whose repetition defines circuit depth L. This distinction underlies our analysis of embedding expressibility and scalability.
  • Figure 4: Circuit templates used in the entanglement ablation. (a) Template A (Skolik-derived) uses a ring of CZ gates to introduce entanglement between adjacent qubits. (b) Template B (Hsiao-derived) is initially unentangled and later extended with a ring of CNOTs. The dotted region denotes the entanglement layer toggled on/off in the experiments. This design isolates the influence of entanglement under otherwise identical conditions.
  • Figure 5: Learning curves for hybrid and classical agents under increasing Output Reuse ($R$). While OR consistently improves hybrid performance, classical agents show only minor or unstable gains, confirming that OR’s efficacy depends on the presence of a quantum block.
  • ...and 3 more figures