Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations

Pavel Tashev; Stefan Petrov; Friederike Metz; Marin Bukov

Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations

Pavel Tashev, Stefan Petrov, Friederike Metz, Marin Bukov

TL;DR

The paper tackles the challenge of disentangling arbitrary multiqubit states using only partial information by casting the problem as a reinforcement-learning task. It employs a permutation-equivariant transformer policy to select which pair of qubits to couple with a two-qubit gate, and computes locally optimal gates from two-qubit reduced density matrices, enabling state-dependent, short disentangling circuits. Across 4–6 qubits and Haar-random initial states, the RL agent outperforms baseline random/greedy strategies, achieving substantial reductions in gate counts and CNOT complexity after transpilation, with demonstrated resilience to shot and hardware noise. These results suggest practical pathways for state preparation and circuit synthesis on NISQ devices, including a general 4-qubit circuit using at most five 2-qubit gates (ten CNOTs) to disentangle any 4-qubit state, with potential extensions to larger systems and tensor-network-inspired architectures.

Abstract

Using partial knowledge of a quantum state to control multiqubit entanglement is a largely unexplored paradigm in the emerging field of quantum interactive dynamics with the potential to address outstanding challenges in quantum state preparation and compression, quantum control, and quantum complexity. We present a deep reinforcement learning (RL) approach to constructing short disentangling circuits for arbitrary 4-, 5-, and 6-qubit states using an actor-critic algorithm. With access to only two-qubit reduced density matrices, our agent decides which pairs of qubits to apply two-qubit gates on; requiring only local information makes it directly applicable on modern NISQ devices. Utilizing a permutation-equivariant transformer architecture, the agent can autonomously identify qubit permutations within the state, and adjusts the disentangling protocol accordingly. Once trained, it provides circuits from different initial states without further optimization. We demonstrate the agent's ability to identify and exploit the entanglement structure of multiqubit states. For 4-, 5-, and 6-qubit Haar-random states, the agent learns to construct disentangling circuits that exhibit strong correlations both between consecutive gates and among the qubits involved. Through extensive benchmarking, we show the efficacy of the RL approach to find disentangling protocols with minimal gate resources. We explore the resilience of our trained agents to noise, highlighting their potential for real-world quantum computing applications. Analyzing optimal disentangling protocols, we report a general circuit to prepare an arbitrary 4-qubit state using at most 5 two-qubit (10 CNOT) gates.

Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations

TL;DR

Abstract

Paper Structure (47 sections, 54 equations, 22 figures, 1 table, 2 algorithms)

This paper contains 47 sections, 54 equations, 22 figures, 1 table, 2 algorithms.

Introduction
Multiqubit Disentangling Problem
Randomly and greedily placed locally optimal gates
Reinforcement Learning to Disentangle Quantum States
Reinforcement learning framework
Actor critic algorithm
Analyzing the behavior of disentangling RL agents
Benchmarking the RL agent on entangled 4-qubit states
Disentangling Haar-random 5-qubit states
Statistical properties of trained RL agents for 4-, 5-, and 6-qubit Haar-random states
RL-informed circuit transpilation
Application on noisy NISQ devices
Sampling noise
Depolarizing noise channel
Hardware noise model
...and 32 more sections

Figures (22)

Figure 1: Schematic representation of the pipeline used in this work. (a) Given an arbitrary entangled state, we design a noise-resilient RL agent to construct a protocol that disentangles it in a minimum number of steps. Blue shaded rectangles indicate a single step of the RL feedback loop which produces a two-qubit unitary gate to apply onto the state (black dots indicate the two qubits acted on). Light grey circles and dashed arcs sketch pictorially the process of reducing entanglement in the multi-qubit state. (b) Reinforcement learning environment: at each step $t$, the agent is given (i) a partial observation $o_t$, consisting of all two-qubit reduced density matrices of the state, and (ii) a reward signal $r_{t+1}$, used to train the agent to find optimal disentangling gates. (c) The RL agent consists of the so-called policy $\pi(a_t|o_t)$ -- a model for a probability distribution over action space (here the qubit pairs $(i, j)$). At each time step the agent selects that qubit pair to apply a quantum gate $U^{(i,j)}$ on, which maximizes the policy; the gate itself is determined analytically. To do this, the observation $o_t$ is fed into a permutation equivariant transformer neural network, which is trained using the rewards $r_t$ to approximate the optimal disentangling policy. This procedure is repeated in a feedback loop until the state is disentangled.
Figure 2: Exponential difficulty of the multiqubit disentangling problem starting from an $L$-qubit Haar-random state. (a) We place locally-optimal two-qubit gates on randomly chosen pairs of qubits (so-called random agent in Sec. \ref{['sec:RL-agent']}), and monitor the average entanglement $S_\text{avg}$ after every gate for quantum states of different sizes. The number of applied gates increases exponentially with the system size. While $S_\text{avg}$ decays exponentially with the number of gates $M$, the corresponding decay timescale $c(L)$ to reach a threshold value of $10^{-3}$ diverges exponentially in the system size $L$ [inset]. (b) Same as in (a) but for a locally greedy protocol (so-called greedy agent in Sec. \ref{['sec:RL-agent']}): at each step, we compute the entanglement after acting on all pairs of qubits and postselect the pair which leads to the smallest value of $S_\text{avg}$. For each system size, the curves are averaged over $2048$ different states; the shaded area shows the corresponding standard deviation.
Figure 3: (a) Interaction loop between the RL agent and the environment. At each time step the agent makes an observation $o_t$ containing partial information about the current state of the environment $s_t$; using that observation it selects an action $a_t$. After the action is applied the environment emits a reward signal $r_{t+1}$ and transitions into a new state $s_{t+1}$, which is subsequently used by the agent to select the next action $a_{t+1}$. (b) Action selection: an observation of the environment state $o_t$ comprises a measurement of all symmetrized two-qubit reduced density matrices (see text). The observation is then fed to both the actor (policy network) and the critic (value network). The actor learns a probability distribution $\pi(a_t|o_t)$ over the action space, called policy, and the action is sampled from that distribution. The critic learns a scalar number that evaluates the actions of the actor and is used for improving its performance during training.
Figure 4: Average episode length (main figure; plotted every 100th iteration) and average accuracy (insets; plotted every iteration), defined as the percentage of disentangled states, achieved by the RL agent during training on 4-, 5-, and 6-qubit systems. Training proceeds in two separate stages: (i) Accuracy improvement dominates learning at early iterations [inset] until the agent's accuracy reaches nearly 100%. This is followed by, (ii), an efficiency improvement stage during which the agent reduces the average number of gates in the circuit needed to disentangle a state. Training converges with the agents trained on larger quantum systems requiring more iterations to converge. The trained agents are able to construct more compressed disentangling circuits than previous known algorithms (see Sec. \ref{['sec:cnot_count']} for a detailed comparison). All hyperparameters used for training are listed in Table \ref{['table:hyperparams']}.
Figure 5: Benchmarking the trained RL agent on a set of four-qubit states starting from (a) a pair of Bell states $|\psi_{1,2,3,4}\rangle {=} |\text{Bell}_{1,2}\rangle |\text{Bell}_{3,4}\rangle$ featuring bipartite entanglement, (b) a tripartite entangled GHZ state $|\psi_{1,2,3,4}\rangle {=} |0_1\rangle|\text{GHZ}_{2,3,4}\rangle$, and (c) a product of a Haar-random tripartite entangled state and a Haar-random single-qubit state, $|\psi_{1,2,3,4}\rangle {=} |\text{R}_1\rangle|\text{R}_{2,3,4}\rangle$. The left-hand side shows the circuit diagram using locally optimal two-qubit gates (indicated by black lines connecting filled black circles). The right-hand side shows in the form of a histogram the RL agent's policy before applying each gate (probabilities rounded to percent). We display separately the most probable actions (i.e., qubit pairs). The 'rest' column denotes the aggregated probability for all remaining actions; the selected action at each step is shown in blue. The RL agent correctly identifies the spatial structure of entanglement even for random states where the latter is not obvious, and disentangles the initial state using as few gates as possible.
...and 17 more figures

Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations

TL;DR

Abstract

Reinforcement Learning to Disentangle Multiqubit Quantum States from Partial Observations

Authors

TL;DR

Abstract

Table of Contents

Figures (22)