Table of Contents
Fetching ...

Artificially intelligent Maxwell's demon for optimal control of open quantum systems

Paolo Andrea Erdman, Robert Czupryniak, Bibek Bhandari, Andrew N. Jordan, Frank Noé, Jens Eisert, Giacomo Guarnieri

TL;DR

This work presents a reinforcement-learning framework in which an agent acts as a quantum Maxwell's demon to optimize feedback control in open quantum systems, balancing long-term cooling power against the thermodynamic cost of measurements. By modeling the system with Lindblad dynamics and POVM measurements, the approach explores regimes defined by the ordering of thermalization, measurement, and unitary timescales, and demonstrates that the RL agent discovers non-intuitive yet interpretable strategies that outperform intuitive benchmarks. Across one- and two-qubit setups, the study reveals regime-dependent policies: conditioned finite-time thermalization in the thermalization-dominated regime, adaptive and sometimes perpendicular measurements with repeated weak probes in the measurement-dominated regime, and modulated thermalization strokes when measurement and thermalization timescales are comparable. The results establish a principled pathway for AI-assisted design of quantum thermodynamic devices, with potential extensions to many-body systems and experimental implementations, and provide Pareto-front insights that inform practical trade-offs between cooling power and measurement cost.

Abstract

Feedback control of open quantum systems is of fundamental importance for practical applications in various contexts, ranging from quantum computation to quantum error correction and quantum metrology. Its use in the context of thermodynamics further enables the study of the interplay between information and energy. However, deriving optimal feedback control strategies is highly challenging, as it involves the optimal control of open quantum systems, the stochastic nature of quantum measurement, and the inclusion of policies that maximize a long-term time- and trajectory-averaged goal. In this work, we employ a reinforcement learning approach to automate and capture the role of a quantum Maxwell's demon: the agent takes the literal role of discovering optimal feedback control strategies in qubit-based systems that maximize a trade-off between measurement-powered cooling and measurement efficiency. Considering weak or projective quantum measurements, we explore different regimes based on the ordering between the thermalization, the measurement, and the unitary feedback timescales, finding different and highly non-intuitive, yet interpretable, strategies. In the thermalization-dominated regime, we find strategies with elaborate finite-time thermalization protocols conditioned on measurement outcomes. In the measurement-dominated regime, we find that optimal strategies involve adaptively measuring different qubit observables reflecting the acquired information, and repeating multiple weak measurements until the quantum state is "sufficiently pure", leading to random walks in state space. Finally, we study the case when all timescales are comparable, finding new feedback control strategies that considerably outperform more intuitive ones. We discuss a two-qubit example where we explore the role of entanglement and conclude discussing the scaling of our results to quantum many-body systems.

Artificially intelligent Maxwell's demon for optimal control of open quantum systems

TL;DR

This work presents a reinforcement-learning framework in which an agent acts as a quantum Maxwell's demon to optimize feedback control in open quantum systems, balancing long-term cooling power against the thermodynamic cost of measurements. By modeling the system with Lindblad dynamics and POVM measurements, the approach explores regimes defined by the ordering of thermalization, measurement, and unitary timescales, and demonstrates that the RL agent discovers non-intuitive yet interpretable strategies that outperform intuitive benchmarks. Across one- and two-qubit setups, the study reveals regime-dependent policies: conditioned finite-time thermalization in the thermalization-dominated regime, adaptive and sometimes perpendicular measurements with repeated weak probes in the measurement-dominated regime, and modulated thermalization strokes when measurement and thermalization timescales are comparable. The results establish a principled pathway for AI-assisted design of quantum thermodynamic devices, with potential extensions to many-body systems and experimental implementations, and provide Pareto-front insights that inform practical trade-offs between cooling power and measurement cost.

Abstract

Feedback control of open quantum systems is of fundamental importance for practical applications in various contexts, ranging from quantum computation to quantum error correction and quantum metrology. Its use in the context of thermodynamics further enables the study of the interplay between information and energy. However, deriving optimal feedback control strategies is highly challenging, as it involves the optimal control of open quantum systems, the stochastic nature of quantum measurement, and the inclusion of policies that maximize a long-term time- and trajectory-averaged goal. In this work, we employ a reinforcement learning approach to automate and capture the role of a quantum Maxwell's demon: the agent takes the literal role of discovering optimal feedback control strategies in qubit-based systems that maximize a trade-off between measurement-powered cooling and measurement efficiency. Considering weak or projective quantum measurements, we explore different regimes based on the ordering between the thermalization, the measurement, and the unitary feedback timescales, finding different and highly non-intuitive, yet interpretable, strategies. In the thermalization-dominated regime, we find strategies with elaborate finite-time thermalization protocols conditioned on measurement outcomes. In the measurement-dominated regime, we find that optimal strategies involve adaptively measuring different qubit observables reflecting the acquired information, and repeating multiple weak measurements until the quantum state is "sufficiently pure", leading to random walks in state space. Finally, we study the case when all timescales are comparable, finding new feedback control strategies that considerably outperform more intuitive ones. We discuss a two-qubit example where we explore the role of entanglement and conclude discussing the scaling of our results to quantum many-body systems.
Paper Structure (21 sections, 57 equations, 21 figures, 2 tables)

This paper contains 21 sections, 57 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Schematic representation of a quantum Maxwell's demon. Based on previous measurement outcomes, the demon can decide whether to (partially) thermalize the quantum system, perform further measurements, or perform unitary feedback. The goal is to optimize the trade-off between cooling power and the cost of measuring the system. In this manuscript, we consider a reinforcement learning agent an actual Maxwell's demon.
  • Figure 2: Schematic representation of the RL method learning to act as an optimal quantum Maxwell's demon. At every small time-step $t_i=i\Delta t$, the RL agent (blue box) interacts with an open quantum system (green box) performing actions $a_i$ (lower orange box) consisting of a discrete choice (whether to thermalize, measure, or evolve unitarily), and a continuous action (representing some time-dependent control). After the quantum state has evolved for a time-step $\Delta t$, the agent receives as feedback the state of the environment $s_{i+1}$, given by the density matrix of the quantum system conditioned on the measurement outcome, and a reward $r_{i+1}$ representing the optimization goal (upper orange boxes). The computer agent must learn an optimal policy $\pi^*(a|s)$ that maximizes the long-term and trajectory averaged trade-off between cooling power and cost of measurement. Through the trial and error attempt, the computer agent learns a gradually better and better policy until convergence.
  • Figure 3: Maxwell's demon performance in the thermalization-dominated regime. Pareto front between the long-term average of the cooling power $\ev{P}$ and of the negative measurement dissipation $-\ev{D}$ (a), and between the cooling power $\ev{P}$ and the measurement efficiency $\eta$ (b). Each black dot corresponds to a separate RL optimization for different values of $c$, whereas the red line corresponds to the Pareto front of the interpretable policy described in Sec. \ref{['sec:tau_t']}. Example of actions chosen by the agent along an arbitrary trajectory, as a function of time, in the $c=0.8$ case (c) and $c=0.65$ case (d). The corresponding points on the Pareto front are shown in (a,b) as empty black circles. The color corresponds to the discrete action (see legend), and the value shown on the $y$-axis corresponds to the value of $u(t)$ during the thermalization step. Measurements are shown as vertical lines. Parameters: $E_0=5\,\beta^{-1}$, $\Gamma=1\,(\beta\hbar)^{-1}$, $u(t)\in [-0.8, 0.8]$, and $\Delta t = \{0.003, 0.02, 0.03, 0.07, 0.1, 0.2, 0.3\} \,\Gamma^{-1}$ for $c=\{0.95,0.9, 0.8,0.7,0.65, 0.6, 0.58\}$ since the thermalization time increases as $c$ decreases.
  • Figure 4: Actions chosen by the RL agent as a function of qubits state represented as a point on the Bloch sphere in the measurement-dominated regime with discrete fixed measurements. The type of action is represented by the color of a point, and the black cross represents the thermal state. We allow the RL agent to cool the thermal bath either by $\sigma_{\rm x}$ measurement (left column) or $\sigma_{\rm z}$ measurement (right column). $\rho_{\rm x}$ and $\rho_{\rm z}$ denote qubit's $x$ and $z$ coordinates on the Bloch sphere. Each row represents a different measurement strength, respectively $\kappa=\{0.99, 0.65, 0.55\}$. Each plot represents a trajectory using the RL policy for 10,000 time steps with parameters $\Gamma=0.8 \,(\beta\hbar)^{-1}$, $\Delta t=0.8\,\Gamma^{-1}$. $E_0=0.5\,\beta^{-1}$.
  • Figure 5: Plots of the state component $\rho_{\rm x}$, as a function of time, corresponding to the results shown in Fig. \ref{['fig:Bloch_discr_meas']}(a,c,e) relative to the $\sigma_{\rm x}$ measurement case. An arbitrary trajectory is shown. The colors of the dots indicate the actions chosen by the RL agent at a given moment.
  • ...and 16 more figures