Table of Contents
Fetching ...

Reinforcement learning for quantum processes with memory

Josep Lumbreras, Ruo Cheng Huang, Yanglin Hu, Marco Fanizza, Mile Gu

Abstract

In reinforcement learning, an agent interacts sequentially with an environment to maximize a reward, receiving only partial, probabilistic feedback. This creates a fundamental exploration-exploitation trade-off: the agent must explore to learn the hidden dynamics while exploiting this knowledge to maximize its target objective. While extensively studied classically, applying this framework to quantum systems requires dealing with hidden quantum states that evolve via unknown dynamics. We formalize this problem via a framework where the environment maintains a hidden quantum memory evolving via unknown quantum channels, and the agent intervenes sequentially using quantum instruments. For this setting, we adapt an optimistic maximum-likelihood estimation algorithm. We extend the analysis to continuous action spaces, allowing us to model general positive operator-valued measures (POVMs). By controlling the propagation of estimation errors through quantum channels and instruments, we prove that the cumulative regret of our strategy scales as $\widetilde{\mathcal{O}}(\sqrt{K})$ over $K$ episodes. Furthermore, via a reduction to the multi-armed quantum bandit problem, we establish information-theoretic lower bounds demonstrating that this sublinear scaling is strictly optimal up to polylogarithmic factors. As a physical application, we consider state-agnostic work extraction. When extracting free energy from a sequence of non-i.i.d. quantum states correlated by a hidden memory, any lack of knowledge about the source leads to thermodynamic dissipation. In our setting, the mathematical regret exactly quantifies this cumulative dissipation. Using our adaptive algorithm, the agent uses past energy outcomes to improve its extraction protocol on the fly, achieving sublinear cumulative dissipation, and, consequently, an asymptotically zero dissipation rate.

Reinforcement learning for quantum processes with memory

Abstract

In reinforcement learning, an agent interacts sequentially with an environment to maximize a reward, receiving only partial, probabilistic feedback. This creates a fundamental exploration-exploitation trade-off: the agent must explore to learn the hidden dynamics while exploiting this knowledge to maximize its target objective. While extensively studied classically, applying this framework to quantum systems requires dealing with hidden quantum states that evolve via unknown dynamics. We formalize this problem via a framework where the environment maintains a hidden quantum memory evolving via unknown quantum channels, and the agent intervenes sequentially using quantum instruments. For this setting, we adapt an optimistic maximum-likelihood estimation algorithm. We extend the analysis to continuous action spaces, allowing us to model general positive operator-valued measures (POVMs). By controlling the propagation of estimation errors through quantum channels and instruments, we prove that the cumulative regret of our strategy scales as over episodes. Furthermore, via a reduction to the multi-armed quantum bandit problem, we establish information-theoretic lower bounds demonstrating that this sublinear scaling is strictly optimal up to polylogarithmic factors. As a physical application, we consider state-agnostic work extraction. When extracting free energy from a sequence of non-i.i.d. quantum states correlated by a hidden memory, any lack of knowledge about the source leads to thermodynamic dissipation. In our setting, the mathematical regret exactly quantifies this cumulative dissipation. Using our adaptive algorithm, the agent uses past energy outcomes to improve its extraction protocol on the fly, achieving sublinear cumulative dissipation, and, consequently, an asymptotically zero dissipation rate.

Paper Structure

This paper contains 94 sections, 28 theorems, 323 equations, 5 figures, 3 algorithms.

Key Result

Theorem 1.1

Let the action space $\mathcal{A}$ be discrete with cardinality $A$. Assuming that the true QHMM environment satisfies the undercomplete assumption with robustness $\kappa_{\mathrm{uc}}$, the OMLE algorithm achieves a cumulative regret of $\widetilde{\mathcal{O}}(\mathop{\mathrm{poly}}\nolimits(A,O,

Figures (5)

  • Figure 1: The Classical-Quantum RL Interface. Scheme of the reinforcement learning framework explored in this work. A classical agent (left) interacts with an unknown quantum process (center) via classical control bitstrings. The agent’s goal is to maximize cumulative rewards—represented here as extracted energy—while minimizing "waste". This waste can be viewed through two lenses: as regret in the context of learning theory, or as dissipation in the context of quantum thermodynamics.
  • Figure 2: Schematic representation of the $L$-step input-output QHMM environment. The environment's initial hidden memory state, $\rho_1$, evolves sequentially through unknown completely positive trace-preserving (CPTP) channels $\mathbb{E}_l$. At each round $l$, a classical agent utilizes the past trajectory to select an action $a_l$, which dictates the quantum instrument $\mathcal{P}^{(a_l)}$ applied to the latent memory. This physical interaction, governed by the completely positive map $\Phi_{o_l}^{(a_l)}$, yields a classical outcome $o_l$. As depicted in the thought bubbles, the agent internally registers a scalar reward $r(o_l)$ based on this outcome and updates its accumulated trajectory history to optimize future interventions. Simultaneously, the post-measurement memory state is updated for the subsequent channel evolution.
  • Figure 3: Schematic of the work extraction. Panel (a) illustrates an example of classical Hidden Markov Model which dictates the dynamics of the process. $\{S_i\}_{i=1}^2$ are the latent classical states, $\{\sigma_i\}_{i=1}^2$ are emitted based on which transition happens at each time step, dictated by the probability $p$. These can be encoded into transition matrix $\mathbb{E}$. Panel (b) illustrates how an agent interact sequentially with the emitted quantum state via CPTNI maps $\{\Phi_{o_i}^{(a_i)}\}_{i=1}^L$. The memory is assumed to be a classical distribution over latent state shown in panel (a) and it evolves according to the dynamic transcribed in $\{\mathbb{E}_l\}_{l=1}^L$. After each interaction, the agent receives outcome $o_l$ in the form of the work value $w_l$ that is stored in the battery. This sequential emission model can be described by input-output QHMM framework, this is shown in Sec. \ref{['sec:sequential_HMM']}.
  • Figure 4: Plot showing cumulative work dissipation against the number of episodes. The black line represents $L=3$, green line being $L=4$ and blue being $L=5$. Their 95% confidence interval is filled with a translucent area. The red line represent a comparison if the agent were to use random policy, i.e. at each time step the agent choose randomly from given actions.
  • Figure 5: A circuit diagram representation of the work extraction protocol. The system $Q$ represents the system where free energy is drawn, $B$ is a battery and $R$ represents a thermal reservoir as an ancillary system. The protocol aims to transform $\rho_Q$ to a thermal state $\gamma_Q$ with the help of the thermal states from the reservoir; the free energy lost in system $Q$ will be balanced by the increase in energy of the battery $B$.

Theorems & Definitions (61)

  • Theorem 1.1: Informal: Regret for Discrete Actions
  • Theorem 1.2: Informal: Regret for Continuous Actions
  • Theorem 1.3: Informal: Information-Theoretic Lower Bounds
  • Definition 2.1: Input-output QHMM environment
  • Remark 2.2
  • Definition 3.2: $\kappa_{\mathrm{uc}}$-robust QHHM environment class
  • Definition 3.3: OOM superoperators on the classical register
  • Lemma 3.4
  • proof
  • Remark 4.1
  • ...and 51 more