Challenges for Reinforcement Learning in Quantum Circuit Design

Philipp Altmann; Jonas Stein; Michael Kölle; Adelina Bärligea; Thomas Gabor; Thomy Phan; Sebastian Feld; Claudia Linnhoff-Popien

Challenges for Reinforcement Learning in Quantum Circuit Design

Philipp Altmann, Jonas Stein, Michael Kölle, Adelina Bärligea, Thomas Gabor, Thomy Phan, Sebastian Feld, Claudia Linnhoff-Popien

TL;DR

The paper addresses automated design of quantum circuits for NISQ hardware using reinforcement learning, introducing qcd-gym to learn policies over a universal gate set under hardware constraints. Key objective measures include state-preparation fidelity $F = | abla angle$; specifically, $F = | angle$ and the unitary-composition distance $D(oldsymbol{U}, oldsymbol{V}(oldsymbol{ abla})) = orm{oldsymbol{U} - oldsymbol{V}(oldsymbol{ abla})}^2$, with a bounded similarity reward $R^{UC} = 1 - an^{-1}(D(oldsymbol{U}, oldsymbol{V}(oldsymbol{ abla})))$. Contributions: formulate SP and UC objectives, define qcd-gym as an MDP, and benchmark model-free RL algorithms (A2C, PPO, TD3, SAC) against GA and Random, revealing both potential and current RL challenges in QCD. Findings show RL can learn nontrivial circuits and SAC often performs best, but robust, hardware-aware scaling requires improved reward shaping, partial observability handling, and integration with hardware constraints.

Abstract

Quantum computing (QC) in the current NISQ era is still limited in size and precision. Hybrid applications mitigating those shortcomings are prevalent to gain early insight and advantages. Hybrid quantum machine learning (QML) comprises both the application of QC to improve machine learning (ML) and ML to improve QC architectures. This work considers the latter, leveraging reinforcement learning (RL) to improve quantum circuit design (QCD), which we formalize by a set of generic objectives. Furthermore, we propose qcd-gym, a concrete framework formalized as a Markov decision process, to enable learning policies capable of controlling a universal set of continuously parameterized quantum gates. Finally, we provide benchmark comparisons to assess the shortcomings and strengths of current state-of-the-art RL algorithms.

Challenges for Reinforcement Learning in Quantum Circuit Design

TL;DR

; specifically,

and the unitary-composition distance

, with a bounded similarity reward

. Contributions: formulate SP and UC objectives, define qcd-gym as an MDP, and benchmark model-free RL algorithms (A2C, PPO, TD3, SAC) against GA and Random, revealing both potential and current RL challenges in QCD. Findings show RL can learn nontrivial circuits and SAC often performs best, but robust, hardware-aware scaling requires improved reward shaping, partial observability handling, and integration with hardware constraints.

Abstract

Paper Structure (22 sections, 19 equations, 4 figures)

This paper contains 22 sections, 19 equations, 4 figures.

Introduction
Background
Quantum Computing
Reinforcement Learning
Quantum Circuit Design Objectives for Reinforcement Learning
State Preparation (SP)
Unitary Composition (UC)
Quantum Circuit Designer
State
Action
Reward
Related Work
Quantum Architecture Search
Quantum Control
Quantum Circuit Optimization
...and 7 more sections

Figures (4)

Figure 1: qcd-gym for $\eta$ qubits with depth $\delta$ for generating a sequence $\Sigma$ of continuously parameterized operation $a$ (blue) to optimally resemble a target state $\ket{\Phi}$ or unitary $\boldsymbol{U}$ in a single optimization loop, based on the observed state $s$ (red) and the reward $r_t$ (green).
Figure 2: Quantum Circuit Design Evaluation: Benchmarking A2C (orange), PPO (blue), SAC (green), and TD3 (red) for Hadamard Composition (Fig. \ref{['fig:eval:hadamard-return']}-\ref{['fig:eval:hadamard-depth']}) and GHZ State Preparation (Fig. \ref{['fig:eval:ghz-qubits']}-\ref{['fig:eval:ghz-depth']}) with regards to the Mean Metric (Fidelity and Similarity, higher is better), Mean Qubits utilized, and Mean Depth of the resulting circuit, against a GA (gray) and a Random baseline (dashed line). Shaded areas mark the 95% confidence intervals. Overall, SAC shows the highest objective performance with the highest qubit and depth utilization (which could be further improved towards the optimal 1/3 operation utilization).
Figure 3: Quantum Circuit Design Evaluation: Benchmarking A2C (orange), PPO (blue), SAC (green), and TD3 (red) for Random State Preparation (Fig. \ref{['fig:eval:random-state-return']}-\ref{['fig:eval:random-state-depth']}) and Toffoli Composition (Fig. \ref{['fig:eval:toffoli-return']}-\ref{['fig:eval:toffoli-depth']}) with regards to the Mean Metric (Fidelity and Similarity, higher is better), Mean Qubits utilized, and Mean Depth of the resulting circuit, against a GA (gray) and a Random baseline (dashed line). Shaded areas mark the 95% confidence intervals. While random states are prepared predominantly well, utilizing all qubits and reasonable depths, the Toffoli Composition exhibits a local similarity optimum of 0.3 for empty circuits.
Figure 4: Additional Benchmark Results: Benchmarking A2C (orange), PPO (blue), SAC (green), and TD3 (red) for Bell State Preparation (Fig. \ref{['fig:eval:bell-return']}-\ref{['fig:eval:bell-depth']}) and Random Composition (Fig. \ref{['fig:eval:random-composition-return']}-\ref{['fig:eval:random-composition-depth']}) with regards to the Mean Metric (Fidelity and Similarity, higher is better), Mean Qubits utilized, and Mean Depth of the resulting circuit, against a GA (gray) and a Random baseline (dashed line). Shaded areas mark the 95% confidence intervals. While the Bell state is almost optimally prepared using SAC, composing random unitary operations yields serious challenges that hinder optimal convergence.

Challenges for Reinforcement Learning in Quantum Circuit Design

TL;DR

Abstract

Challenges for Reinforcement Learning in Quantum Circuit Design

Authors

TL;DR

Abstract

Table of Contents

Figures (4)