Dissecting Quantum Reinforcement Learning: A Systematic Evaluation of Key Components
Javier Lazaro, Juan-Ignacio Vazquez, Pablo Garcia-Bringas
TL;DR
This work addresses the training instability and attribution challenges in PQC-based Quantum Reinforcement Learning (QRL) by performing a controlled, component-wise evaluation of three critical blocks: post-PQC inference, observation embedding via Data Reuploading (DR), and PQC ansatz design with entanglement. Using a unified PPO–CartPole framework and the SimplyQRL benchmarking suite, the authors show that Output Reuse (OR) yields gains only when paired with a meaningful quantum readout, that DR consistently enhances trainability with embedding-dependent scaling, and that entanglement effects are architecture-specific, sometimes degrading optimization. The study provides a reproducible protocol and open dataset for principled ablations, enabling causal attribution of quantum–classical synergy and guiding principled QRL design. Collectively, these findings advance our understanding of how to compose quantum and classical components in hybrid QRL, offering practical guidance for robust, scalable quantum-enhanced RL systems.
Abstract
Parameterised quantum circuit (PQC) based Quantum Reinforcement Learning (QRL) has emerged as a promising paradigm at the intersection of quantum computing and reinforcement learning (RL). By design, PQCs create hybrid quantum-classical models, but their practical applicability remains uncertain due to training instabilities, barren plateaus (BPs), and the difficulty of isolating the contribution of individual pipeline components. In this work, we dissect PQC based QRL architectures through a systematic experimental evaluation of three aspects recurrently identified as critical: (i) data embedding strategies, with Data Reuploading (DR) as an advanced approach; (ii) ansatz design, particularly the role of entanglement; and (iii) post-processing blocks after quantum measurement, with a focus on the underexplored Output Reuse (OR) technique. Using a unified PPO-CartPole framework, we perform controlled comparisons between hybrid and classical agents under identical conditions. Our results show that OR, though purely classical, exhibits distinct behaviour in hybrid pipelines, that DR improves trainability and stability, and that stronger entanglement can degrade optimisation, offsetting classical gains. Together, these findings provide controlled empirical evidence of the interplay between quantum and classical contributions, and establish a reproducible framework for systematic benchmarking and component-wise analysis in QRL.
