Table of Contents
Fetching ...

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis

Sara Giordano, Kornikar Sen, Miguel A. Martin-Delgado

TL;DR

This work tackles the challenge of efficiently synthesizing quantum circuits that prepare a target state from a fixed initial state by employing a tabular Q-learning framework on a discretized SWEET state space. It introduces a circuit-aware, hybrid reward design that combines offline static rewards with online dynamic penalties to drive minimum-depth and low-gate-count circuits, demonstrated on graph-state benchmarks up to seven qubits. The results show depth-optimal graph-state circuits matching theoretical bounds, and high-fidelity approximations when using a universal gate set, all while leveraging sparse, database-like storage to manage the large state-action space. The approach offers a resource-efficient foundation for quantum circuit optimization in the NISQ and fault-tolerant eras and provides a path toward extending to unitary synthesis and deep RL with parameterized gates.

Abstract

A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the Noisy Intermediate-Scale Quantum (NISQ) era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension.The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. This is a circuit-aware reward, in contrast to the current trend of works on this topic, which are primarily fidelity-based. By leveraging sparse matrix representations and state-space discretization, the method enables practical navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set still yields low depth circuits, highlighting the algorithm robustness and adaptability. The results confirm that this RL-driven approach, with our completely circuit-aware method, efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.

Hybrid Reward-Driven Reinforcement Learning for Efficient Quantum Circuit Synthesis

TL;DR

This work tackles the challenge of efficiently synthesizing quantum circuits that prepare a target state from a fixed initial state by employing a tabular Q-learning framework on a discretized SWEET state space. It introduces a circuit-aware, hybrid reward design that combines offline static rewards with online dynamic penalties to drive minimum-depth and low-gate-count circuits, demonstrated on graph-state benchmarks up to seven qubits. The results show depth-optimal graph-state circuits matching theoretical bounds, and high-fidelity approximations when using a universal gate set, all while leveraging sparse, database-like storage to manage the large state-action space. The approach offers a resource-efficient foundation for quantum circuit optimization in the NISQ and fault-tolerant eras and provides a path toward extending to unitary synthesis and deep RL with parameterized gates.

Abstract

A reinforcement learning (RL) framework is introduced for the efficient synthesis of quantum circuits that generate specified target quantum states from a fixed initial state, addressing a central challenge in both the Noisy Intermediate-Scale Quantum (NISQ) era and future fault-tolerant quantum computing. The approach utilizes tabular Q-learning, based on action sequences, within a discretized quantum state space, to effectively manage the exponential growth of the space dimension.The framework introduces a hybrid reward mechanism, combining a static, domain-informed reward that guides the agent toward the target state with customizable dynamic penalties that discourage inefficient circuit structures such as gate congestion and redundant state revisits. This is a circuit-aware reward, in contrast to the current trend of works on this topic, which are primarily fidelity-based. By leveraging sparse matrix representations and state-space discretization, the method enables practical navigation of high-dimensional environments while minimizing computational overhead. Benchmarking on graph-state preparation tasks for up to seven qubits, we demonstrate that the algorithm consistently discovers minimal-depth circuits with optimized gate counts. Moreover, extending the framework to a universal gate set still yields low depth circuits, highlighting the algorithm robustness and adaptability. The results confirm that this RL-driven approach, with our completely circuit-aware method, efficiently explores the complex quantum state space and synthesizes near-optimal quantum circuits, providing a resource-efficient foundation for quantum circuit optimization.

Paper Structure

This paper contains 26 sections, 18 equations, 7 figures, 3 algorithms.

Figures (7)

  • Figure 1: A pictorial representation of the overall pipeline that is illustrated in Section \ref{['sec:3']}. On the left, the green block is the offline reward assignment: given the target state, the number of strata, and the maximum value of the reward, we build an offline static reward map; this step runs once. At the center, the yellow block is the training phase, where an $\varepsilon$-greedy agent interacts with the environment; at each step it selects an action $a_t$, observes the next state $s_{t+1}$, and receives the total reward, which is fed back from the environment as a combination of the offline static reward and online penalties. Until the number of episodes is reached, the training continues. On the right, the blue block is the testing phase, with the learned $Q$ frozen, we construct with a completely greedy strategy the final circuit. If we reach the maximum allowed circuit length without having reached the target state, we return to the training phase. Otherwise, if we reach the target, we output the final optimized circuit. Arrows indicate the data flow between stages and the inputs and outputs of the three sections.
  • Figure 2: The figure illustrates the structure of two specific graphs alongside diagrams of the circuits used to generate them. The top panel displays a $4$-vertex graph (left) and a $7$-vertex graph (right). Edges of the same color signify that the corresponding CZ gates can be applied simultaneously at identical time steps to create the graph states. The bottom panel shows the optimal circuits with minimal depths required to generate these graph states. In the circuit diagrams, vertical solid lines represent the application of CZ gates, while colored circles indicate the qubits involved in each gate's application. Vertical dashed lines separate different time steps.
  • Figure 3: The learned Q-matrix and its constituent reward components. The figure presents 3D bar charts of the matrices used in the Q-learning algorithm for a $4$-qubit graph state. From left to right, the first chart depicts the static reward matrix $R_{\text{sta}}(s,a)$ with two reward strata. The second shows the learned Q-values $Q(s,a)$ after training. The third illustrates the dynamic penalty matrix $R_{\text{dyn}}(s,a)$, computed online. Color gradients indicate the magnitude and sign of values: warm colors for positive and cool colors for negative. For the purpose of clear visualization, only the rows and columns containing nonzero entries of the matrices are shown. Collectively, these plots illustrate how static guidance and dynamic penalties interact to construct and learn a policy. For more details, see Sec. \ref{['subsec:r_static_dynamic']} and Subsection \ref{['subsec:rl_visual']}.
  • Figure 4: Exploration Steps (blue) and Space Size (orange) as a function of the number of qubits $n$. Both series are shown on the same axes with a logarithmic $y$-scale to make the different scaling regimes directly comparable. The fitted multiplicative growth model $\log_{10} y(n) \approx A + B n$ indicates an average per-qubit growth factor of $\sim 7.2\times$ for the Exploration Steps, but $\sim 3.4\times10^{18}\times$ for the Space Size. While the Exploration Steps increase by about $3.5$ orders of magnitude between $n=3$ and $n=8$, the Space Size explodes by more than $73$ orders of magnitude over the same interval, highlighting the combinatorial blow-up of the search space.
  • Figure 5: Scaling of computational and memory cost with the number of qubits. (Top) Total runtime required to obtain the optimal circuit with respect to the number of qubits involved in the circuit is shown, where the runtime is computed from the number of exploration steps assuming $4$ ms per step. (Bottom) Storage cost of the learned data structures shown considering the sizes (in MB) of the Q and $R_{\text{dyn}}$ databases produced during training. This joint view shows that, although the runtime grows with the number of qubits, the corresponding storage overhead remains moderate and well below the combinatorial growth of the underlying state space.
  • ...and 2 more figures