On Quantum Natural Policy Gradients

André Sequeira; Luis Paulo Santos; Luis Soares Barbosa

On Quantum Natural Policy Gradients

André Sequeira, Luis Paulo Santos, Luis Soares Barbosa

TL;DR

The results indicate that a PQC-based agent using the quantum FIM without additional insights typically incurs a larger approximation error and does not guarantee improved performance compared to the classical FIM.

Abstract

This research delves into the role of the quantum Fisher Information Matrix (FIM) in enhancing the performance of Parameterized Quantum Circuit (PQC)-based reinforcement learning agents. While previous studies have highlighted the effectiveness of PQC-based policies preconditioned with the quantum FIM in contextual bandits, its impact in broader reinforcement learning contexts, such as Markov Decision Processes, is less clear. Through a detailed analysis of Löwner inequalities between quantum and classical FIMs, this study uncovers the nuanced distinctions and implications of using each type of FIM. Our results indicate that a PQC-based agent using the quantum FIM without additional insights typically incurs a larger approximation error and does not guarantee improved performance compared to the classical FIM. Empirical evaluations in classic control benchmarks suggest even though quantum FIM preconditioning outperforms standard gradient ascent, in general it is not superior to classical FIM preconditioning.

On Quantum Natural Policy Gradients

TL;DR

Abstract

Paper Structure (9 sections, 1 theorem, 41 equations, 4 figures, 3 tables)

This paper contains 9 sections, 1 theorem, 41 equations, 4 figures, 3 tables.

Introduction
Quantum Policy Gradients
Natural gradients in policy optimization
Performance Evaluation in Benchmarking Environments
Comparative analysis for the estimation of information matrices
Sample complexity of estimating classical FIM
Sample complexity of estimating quantum FIM
Conclusion
Tables for environments description and PQC's

Key Result

Lemma 3.1

Fix a comparison policy $\tilde{\pi}$ and a state distribution $\rho$. Assume for all $s \in \mathcal{S}$ and $a \in \mathcal{A}$ that $\log \pi(a \mid s,\theta)$ is a $\beta$-smooth function of $\theta$. Consider $\pi^{(0)}$ the uniform distribution for every state and the sequence of weights $w^{( Then the regret at time step $t$ is upper bounded by:

Figures (4)

Figure 1: The parameterized quantum circuit used in the numerical experiments. Data reuploading is consistent with jerbi_parametrized_2021, but input scaling was excluded to improve the estimation of the Quantum FIM matrices.
Figure 2: Born policies for Cartpole and Acrobot environments.
Figure 3: Performance of the NPG algorithm (and its generalized quantum counterpart) in the Cartpole environment. Subfigures (a) and (b) represent the performance of Born and Softmax policies using the cumulative reward as the evaluation metric.
Figure 4: Performance of the NPG algorithm (and its generalized quantum counterpart) in the Acrobot environment. Subfigures (a) and (b) showcase the performance of Born and Softmax policies using cumulative reward as a performance measure.

Theorems & Definitions (3)

Definition 2.1
Definition 2.2
Lemma 3.1: NPG Regret Lemma agarwal_theory_2021

On Quantum Natural Policy Gradients

TL;DR

Abstract

On Quantum Natural Policy Gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)