Trainability issues in quantum policy gradients
André Sequeira, Luis Paulo Santos, Luis Soares Barbosa
TL;DR
This work tackles the trainability of PQC-based policies in reinforcement learning by analyzing cost-function dependent barren plateaus and gradient explosions. It develops a formal framework for quantum policy gradients, contrasting Contiguous-like and Parity-like Born policies with Softmax policies, and derives variance and Fisher Information Matrix (FIM) based insights that connect action-space size, qubit count, and measurement locality to trainability. The key contributions include lower bounds on the gradient variance for Contiguous-like policies, evidence of barren plateaus for Parity-like policies under polynomially sized action spaces, and a transition to exploding gradients when actions grow beyond polynomial in $n$; these are corroborated by numerical experiments on simplified 2-design PQCs and multi-armed bandits. The findings illuminate the practical constraints of PQC-based RL and guide design choices for action-space scaling, measurement locality, and ansatz selection, with implications for achieving quantum advantage in policy optimization. Overall, the work clarifies when PQC-based policy gradients can be trained efficiently and where fundamental limitations arise, informing future efforts to mitigate barren plateaus and gradient explosions in quantum RL.
Abstract
This research explores the trainability of Parameterized Quantum circuit-based policies in Reinforcement Learning, an area that has recently seen a surge in empirical exploration. While some studies suggest improved sample complexity using quantum gradient estimation, the efficient trainability of these policies remains an open question. Our findings reveal significant challenges, including standard Barren Plateaus with exponentially small gradients and gradient explosion. These phenomena depend on the type of basis-state partitioning and mapping these partitions onto actions. For a polynomial number of actions, a trainable window can be ensured with a polynomial number of measurements if a contiguous-like partitioning of basis-states is employed. These results are empirically validated in a multi-armed bandit environment.
