Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

Han Zhong; Jiachen Hu; Yecheng Xue; Tongyang Li; Liwei Wang

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

Han Zhong, Jiachen Hu, Yecheng Xue, Tongyang Li, Liwei Wang

TL;DR

This work tackles the challenge of online exploration in reinforcement learning by enabling quantum speedups through quantum oracles. It introduces two algorithms: Quantum UCRL for tabular MDPs and Quantum UCRL-VTR for linear mixture MDPs, each with regret bounds that depend polylogarithmically on the episode count, i.e., $\mathcal{O}(\mathrm{poly}(S,A,H,\log T))$ and $\mathcal{O}(\mathrm{poly}(d,H,\log T))$ respectively. Central to the approach are quantum mean estimation and amplitude estimation subroutines, a doubling-based lazy updating scheme to reuse quantum samples, and novel feature/ regression-target estimation techniques, all enabling logarithmic-like dependence on $T$ relative to classical $\Omega(\sqrt{T})$ lower bounds. The results establish the first provable logarithmic worst-case regret for online quantum RL and point to a broader framework for extending quantum speedups to general function approximation in RL.

Abstract

While quantum reinforcement learning (RL) has attracted a surge of attention recently, its theoretical understanding is limited. In particular, it remains elusive how to design provably efficient quantum RL algorithms that can address the exploration-exploitation trade-off. To this end, we propose a novel UCRL-style algorithm that takes advantage of quantum computing for tabular Markov decision processes (MDPs) with $S$ states, $A$ actions, and horizon $H$, and establish an $\mathcal{O}(\mathrm{poly}(S, A, H, \log T))$ worst-case regret for it, where $T$ is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with $d$-dimensional linear representation and prove that it enjoys $\mathcal{O}(\mathrm{poly}(d, H, \log T))$ regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the $Ω(\sqrt{T})$-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

TL;DR

and

respectively. Central to the approach are quantum mean estimation and amplitude estimation subroutines, a doubling-based lazy updating scheme to reuse quantum samples, and novel feature/ regression-target estimation techniques, all enabling logarithmic-like dependence on

relative to classical

lower bounds. The results establish the first provable logarithmic worst-case regret for online quantum RL and point to a broader framework for extending quantum speedups to general function approximation in RL.

Abstract

states,

actions, and horizon

, and establish an

worst-case regret for it, where

is the number of episodes. Furthermore, we extend our results to quantum RL with linear function approximation, which is capable of handling problems with large state spaces. Specifically, we develop a quantum algorithm based on value target regression (VTR) for linear mixture MDPs with

-dimensional linear representation and prove that it enjoys

regret. Our algorithms are variants of UCRL/UCRL-VTR algorithms in classical RL, which also leverage a novel combination of lazy updating mechanisms and quantum estimation subroutines. This is the key to breaking the

-regret barrier in classical RL. To the best of our knowledge, this is the first work studying the online exploration in quantum RL with provable logarithmic worst-case regret.

Paper Structure (47 sections, 14 theorems, 112 equations, 6 algorithms)

This paper contains 47 sections, 14 theorems, 112 equations, 6 algorithms.

Introduction
Contributions.
Challenges and Technical Overview.
Preliminaries
Linear Function Approximation.
Quantum Reinforcement Learning
Quantum Computing
Information transfer between quantum and classical computers.
Quantum multi-dimensional amplitude estimation and multivariate mean estimation.
Quantum-Accessible Environments
Quantum Exploration Problem
Warmup: Results for Tabular MDPs
Data collection scheme.
Lazy updating via doubling trick.
Optimistic planning.
...and 32 more sections

Key Result

Lemma 3.1

Assume that we have access to the probability oracle $U_p\colon\vert 0 \rangle \rightarrow \sum_{i = 0}^{n-1} \sqrt{p_i}\vert i \rangle\vert \phi_i \rangle$ for an $n$-dimensional probability distribution $p$ and ancilla quantum statesAncilla quantum states help and broaden the scope of quantum comp

Theorems & Definitions (22)

Definition 2.1: Linear Mixture MDP
Lemma 3.1: Quantum multi-dimensional amplitude estimation, Rephrased from Theorem 5 of van2021quantum
Lemma 3.2: Quantum multivariate mean estimation, Rephrased from Theorem 3.3 of cornelissen2022near
Theorem 4.1
Remark 5.1
Theorem 5.2
Definition B.1: Probability Oracle
Definition B.2: Binary Oracle
Remark B.3
Remark B.4: Discussion on Counter Updating
...and 12 more

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

TL;DR

Abstract

Provably Efficient Exploration in Quantum Reinforcement Learning with Logarithmic Worst-Case Regret

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (22)