Table of Contents
Fetching ...

Quantum reinforcement learning in continuous action space

Shaojun Wu, Shan Jin, Dingding Wen, Donghong Han, Xiaoting Wang

TL;DR

This work tackles learning in continuous-action quantum reinforcement learning by introducing a quantum Deep Deterministic Policy Gradient (DDPG) framework that uses variational quantum neural networks to model both the policy and value functions. The approach enables single-shot quantum state generation: a one-time optimization yields a model that outputs a sequence of parametric unitaries $U_a(\bm\theta_t)$ capable of driving any $|s_0\rangle$ to a target $|s_d\rangle$, with the inverse sequence allowing recovery of $|s_0\rangle$ from $|s_d\rangle$. The authors demonstrate applications to quantum state generation and eigenvalue problems by embedding the environment in quantum registers and using quantum phase estimation as part of the reward structure, achieving high overlap values (e.g., $p_{t+1}$ near 1) in simulations for one- and two-qubit systems. A complexity analysis shows the method requires $K=\mathcal{O}(1/\epsilon^2)$ measurements and yields gate complexities that scale with the problem size similarly to other quantum-classical hybrid methods like VQE, highlighting its potential for near-term quantum devices and broader quantum-control tasks.

Abstract

Quantum reinforcement learning (QRL) is a promising paradigm for near-term quantum devices. While existing QRL methods have shown success in discrete action spaces, extending these techniques to continuous domains is challenging due to the curse of dimensionality introduced by discretization. To overcome this limitation, we introduce a quantum Deep Deterministic Policy Gradient (DDPG) algorithm that efficiently addresses both classical and quantum sequential decision problems in continuous action spaces. Moreover, our approach facilitates single-shot quantum state generation: a one-time optimization produces a model that outputs the control sequence required to drive a fixed initial state to any desired target state. In contrast, conventional quantum control methods demand separate optimization for each target state. We demonstrate the effectiveness of our method through simulations and discuss its potential applications in quantum control.

Quantum reinforcement learning in continuous action space

TL;DR

This work tackles learning in continuous-action quantum reinforcement learning by introducing a quantum Deep Deterministic Policy Gradient (DDPG) framework that uses variational quantum neural networks to model both the policy and value functions. The approach enables single-shot quantum state generation: a one-time optimization yields a model that outputs a sequence of parametric unitaries capable of driving any to a target , with the inverse sequence allowing recovery of from . The authors demonstrate applications to quantum state generation and eigenvalue problems by embedding the environment in quantum registers and using quantum phase estimation as part of the reward structure, achieving high overlap values (e.g., near 1) in simulations for one- and two-qubit systems. A complexity analysis shows the method requires measurements and yields gate complexities that scale with the problem size similarly to other quantum-classical hybrid methods like VQE, highlighting its potential for near-term quantum devices and broader quantum-control tasks.

Abstract

Quantum reinforcement learning (QRL) is a promising paradigm for near-term quantum devices. While existing QRL methods have shown success in discrete action spaces, extending these techniques to continuous domains is challenging due to the curse of dimensionality introduced by discretization. To overcome this limitation, we introduce a quantum Deep Deterministic Policy Gradient (DDPG) algorithm that efficiently addresses both classical and quantum sequential decision problems in continuous action spaces. Moreover, our approach facilitates single-shot quantum state generation: a one-time optimization produces a model that outputs the control sequence required to drive a fixed initial state to any desired target state. In contrast, conventional quantum control methods demand separate optimization for each target state. We demonstrate the effectiveness of our method through simulations and discuss its potential applications in quantum control.

Paper Structure

This paper contains 12 sections, 2 theorems, 8 equations, 9 figures, 1 algorithm.

Key Result

Lemma 1

Let $X$ is a random variable with expected value $\mu$ and variance $\sigma^2$. For any real number $k>0$, $P(|X-\mu|\ge k\sigma) \leq \frac{1}{k^2}$.

Figures (9)

  • Figure 1: The QRL model. Each iterative step can be described by the following loop: (1) at step $t$, the agent receives $| s_t \rangle$ and generates the action parameter $\bm\theta_t$ according to the current policy; (2) the agent generates $| s_{t+1} \rangle\equiv U_a(\bm\theta_t)| s_t \rangle$; (3) based on $| s_t \rangle$ and $| s_{t+1} \rangle$, a reward $r_{t+1}$ is calculated and fed back to the agent, together with $| s_{t+1} \rangle$; (4) based on $| s_{t+1} \rangle$ and $r_{t+1}$, the policy is updated and then used to generate $\bm\theta_{t+1}$.
  • Figure 2: The quantum circuit for our QRL framework at each iteration. The entire QRL process includes two stages, so we give the circuit separately. In stage 1, the circuit includes two registers: the reward register, initialized $| 0 \rangle$, and the environment register $| s_t \rangle$. $U_{\operatorname{policy}}$ is generated by the quantum neural network, and determines the action unitary $U_a (\bm\theta_t)$. $U_r$ and $M$ are designed to generate the reward $r_{t+1}$. In stage 2, the circuit has only environment register and does not need to feedback the reward value and update the policy.
  • Figure 3: Circuit architecture for the VQC. $R_{\beta}(\alpha)\equiv \exp(-i\sigma_{\beta}\alpha/2)$ with $\beta=x,z$. $U_{\textup{ENT}}\equiv \prod_{k=1}^{n-1}\operatorname{CNOT}_{(k,k+1)}$, where $\operatorname{CNOT}_{(k,k+1)}$ denotes the CNOT gate using the $k$-th qubit to control the $(k+1)$-th qubit. $C_j$ is the outcome of the measurement on the observable $B_j$, $j=1,2,\cdots$.
  • Figure 4: Simulation results for quantum state generation problem of the one-qubit and the two-qubit Hamiltonian by quantum DDPG. For $1000$ different initial $| s_0 \rangle$, we plot how the average $\bar{p}_t$ and the variance $\Delta(p_{t})$ change with the iteration step $t$. For the one-qubit case, at $t=50$, $\bar{p}_{50} \ge 0.99$ and $\Delta(p_{50})\leq 4.47\times 10^{-5}$. For the two-qubit case, at $t=50$, $\bar{p}_{50} \ge 0.98$ and $\Delta(p_{50})\leq 4.04\times 10^{-7}$.
  • Figure 5: The quantum phase estimation circuit $U_{\operatorname{PE}}$.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Lemma 1: Chebyshev's inequality
  • Theorem 1
  • proof