Trainability issues in quantum policy gradients

André Sequeira; Luis Paulo Santos; Luis Soares Barbosa

Trainability issues in quantum policy gradients

André Sequeira, Luis Paulo Santos, Luis Soares Barbosa

TL;DR

This work tackles the trainability of PQC-based policies in reinforcement learning by analyzing cost-function dependent barren plateaus and gradient explosions. It develops a formal framework for quantum policy gradients, contrasting Contiguous-like and Parity-like Born policies with Softmax policies, and derives variance and Fisher Information Matrix (FIM) based insights that connect action-space size, qubit count, and measurement locality to trainability. The key contributions include lower bounds on the gradient variance for Contiguous-like policies, evidence of barren plateaus for Parity-like policies under polynomially sized action spaces, and a transition to exploding gradients when actions grow beyond polynomial in $n$; these are corroborated by numerical experiments on simplified 2-design PQCs and multi-armed bandits. The findings illuminate the practical constraints of PQC-based RL and guide design choices for action-space scaling, measurement locality, and ansatz selection, with implications for achieving quantum advantage in policy optimization. Overall, the work clarifies when PQC-based policy gradients can be trained efficiently and where fundamental limitations arise, informing future efforts to mitigate barren plateaus and gradient explosions in quantum RL.

Abstract

This research explores the trainability of Parameterized Quantum circuit-based policies in Reinforcement Learning, an area that has recently seen a surge in empirical exploration. While some studies suggest improved sample complexity using quantum gradient estimation, the efficient trainability of these policies remains an open question. Our findings reveal significant challenges, including standard Barren Plateaus with exponentially small gradients and gradient explosion. These phenomena depend on the type of basis-state partitioning and mapping these partitions onto actions. For a polynomial number of actions, a trainable window can be ensured with a polynomial number of measurements if a contiguous-like partitioning of basis-states is employed. These results are empirically validated in a multi-armed bandit environment.

Trainability issues in quantum policy gradients

TL;DR

; these are corroborated by numerical experiments on simplified 2-design PQCs and multi-armed bandits. The findings illuminate the practical constraints of PQC-based RL and guide design choices for action-space scaling, measurement locality, and ansatz selection, with implications for achieving quantum advantage in policy optimization. Overall, the work clarifies when PQC-based policy gradients can be trained efficiently and where fundamental limitations arise, informing future efforts to mitigate barren plateaus and gradient explosions in quantum RL.

Abstract

Paper Structure (17 sections, 4 theorems, 30 equations, 12 figures, 1 table, 1 algorithm)

This paper contains 17 sections, 4 theorems, 30 equations, 12 figures, 1 table, 1 algorithm.

Introduction
Quantum Policy Gradients
Born policy
Softmax policy
Gradient estimation
Trainability issues in Born policies
The instructive case of product states
Generalized behavior for entangled states
Variance as a function of $|A|$
Analysis of the Fisher Information spectrum
Summary
Numerical experiments
Trainability issues using a simplified 2-design
Multi armed bandits
Conclusion
...and 2 more sections

Key Result

Lemma 3.1

Let $\pi(a|s,\theta)$ be a $n$-qubit PQC-based policy with $\theta \in \mathbb{R}^k$. Let $T$ be the trajectories horizon, $R_{\text{max}}$ be the maximum reward and $\gamma$ the trajectories discount factor. Then, the policy gradient variance w.r.t variational parameters $\theta$ is upper bounded b

Figures (12)

Figure 1: Partitions considered for the base case of $|A|=2$. (a) and (b) illustrates a Contiguous-like and parity-like partitioning, respectively, of all $2^n$ basis states. (c) Action-projector-like partitioning considering just two basis states.
Figure 2: Variance and expectation value of the gradient of log-probability cost-function. (a) Variance for the all-zero state with parameterized state of individual qubit y-rotations. (b) Random product state composed of Pauli rotations sampled uniformly at random. (c) Expectation value for randomly sampled projectors in the random product state in (b).
Figure 3: variance of the log policy gradient for three distinct entangled states. (a) Simplified two design. (b) Strongly entangling layers. (c) Random states composed of Pauli rotations sampled uniformly at random followed by randomly selected CZ gates. (d) Variance as a function of the number of qubits for the circuits (a)-(c).
Figure 4: Simplified two-design ansatz with the first set of rotations in purple representing the unitary $E(s)$ responsible for the encoding of the state of the agent, $s$.
Figure 5: variance of the log policy gradient for contiguous-like Born policies: (a) and (b) as a function of $|A|$ and (c) semi-logarithmic plot for varying number of qubits.
...and 7 more figures

Theorems & Definitions (9)

Definition 2.1
Definition 2.2
Lemma 3.1
proof
Lemma 3.2
proof
Lemma 3.3
Lemma B.1
proof

Trainability issues in quantum policy gradients

TL;DR

Abstract

Trainability issues in quantum policy gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (9)