Optimizing ZX-Diagrams with Deep Reinforcement Learning

Maximilian Nägele; Florian Marquardt

Optimizing ZX-Diagrams with Deep Reinforcement Learning

Maximilian Nägele, Florian Marquardt

TL;DR

The paper addresses the challenge of optimizing ZX-diagrams by learning sequences of local ZX calculus transformations with reinforcement learning. It introduces a graph neural network policy trained via PPO to operate directly on ZX diagrams, with actions on nodes and edges as well as a global Stop, and a reward based on reducing the diagram size while preserving the represented quantum process up to a scalar. The results show the RL agent outperforms greedy, simulated annealing, and handcrafted ZX diagram optimizers and generalizes to diagrams much larger than those seen in training, indicating strong transferability. The work suggests broad applicability to tasks such as gate count reduction, tensor-network speeding, and circuit equivalence checking, and outlines future directions to incorporate gFlow or Pauli flow preserving rules and circuit-extraction aware rewards.

Abstract

ZX-diagrams are a powerful graphical language for the description of quantum processes with applications in fundamental quantum mechanics, quantum circuit optimization, tensor network simulation, and many more. The utility of ZX-diagrams relies on a set of local transformation rules that can be applied to them without changing the underlying quantum process they describe. These rules can be exploited to optimize the structure of ZX-diagrams for a range of applications. However, finding an optimal sequence of transformation rules is generally an open problem. In this work, we bring together ZX-diagrams with reinforcement learning, a machine learning technique designed to discover an optimal sequence of actions in a decision-making problem and show that a trained reinforcement learning agent can significantly outperform other optimization techniques like a greedy strategy, simulated annealing, and state-of-the-art hand-crafted algorithms. The use of graph neural networks to encode the policy of the agent enables generalization to diagrams much bigger than seen during the training phase.

Optimizing ZX-Diagrams with Deep Reinforcement Learning

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 10 figures, 2 tables)

This paper contains 16 sections, 8 equations, 10 figures, 2 tables.

Introduction
ZX-diagrams
Optimization of ZX-diagrams as a reinforcement learning problem
Neural network architecture
Results
Training
Comparison with other techniques
Analysis of learned policy
Scaling
Outlook
Data availability
Details on ZX-calculus
Sampled diagrams
Details on custom PPO algorithm
Details on network architecture
...and 1 more sections

Figures (10)

Figure 1: Schematic of the optimization loop. At each step, the reinforcement learning agent is provided with a ZX-diagram in the form of a graph. The agent then uses a graph neural network to suggest action probabilities of local graph transformations (color-coded), which act on either a unique edge (orange) or node (blue). Finally, an action is sampled from this probability distribution and applied to the diagram. In total, there are 6 separate actions per node and edge, some of which are not allowed in their local environment and, therefore, masked (grey dots). For a definition of the graph transformations see \ref{['fig:fig_transfo']}.
Figure 2: Encoding of the local transformation rules of ZX-diagrams as actions of a reinforcement learning agent. Blue colors indicate the encoding as an action of the agent acting on either an edge or a node. Some transformations are implemented in both directions as separate actions of the reinforcement learning agent (equal signs), while some are only implemented in one direction (arrows). Three dots stand for zero or more edges. Each rule also holds with the spiders' colors inverted and in both directions. Black squares represent a Hadamard gate as defined by the Hadamard fuse transformation. During the Unfuse transformation, a spider is split into two by arbitrarily splitting up its angle between the two resulting spiders, connecting them with a new edge, and transferring a subset of the originally connected edges (orange) to the new spider. In the Copy transformation, $a\in {0,1}$. In the Euler transformation, $\alpha_1/\beta_1/\gamma_1$ are related to $\alpha_2/\beta_2/\gamma_2$ by trigonometric functions as defined in Vilmart2019.
Figure 3: Results. (a) Training progress as the agent is trained to reduce the node number in random ZX-diagrams. Mean cumulative reward of the agent per trajectory against total steps taken in the environment. (b) Optimization of an example ZX-diagram ten times larger than the RL agent's training diagrams. Number of nodes in the ZX-diagram against each action taken for the RL agent (orange), a greedy strategy (blue), and simulated annealing (green). For the RL agent and simulated annealing, multiple trajectories are plotted (transparent). The RL agent and simulated annealing significantly outperform the greedy strategy in terms of cumulative reward with the RL agent requiring an order of magnitude less steps than simulated annealing (inlay). Actions taken by the RL agent that intermittently increase the node number (i.e. non-greedy actions) are indicated by arrows. (c) Average number of nodes after optimization of $1000$ ZX-diagrams with $10$-$15$ initial spiders (left), which is the size the agent was trained on, and $100$-$150$ initial spiders (right). We compare the RL agent (orange) to simulated annealing (green), a custom greedy strategy (blue), the full_ reduce function of the PyZX software package (red), a combination of full_ reduce and the greedy strategy (turquoise), and a combination of full_ reduce and the RL agent (yellow). Hyperparameters for simulated annealing are optimized to give good performance on two example diagrams and then kept fixed for all diagrams. The RL agent is outperforming the other strategies. (d) Two examples of non-greedy actions learned by the agent (orange lines), that lead to a positive cumulative reward by consecutive Fuse actions (blue lines). (e) Example ZX-diagram sampled from the agent's training set. The greedy strategy can reduce the node number by applying $3$Fuse actions (blue lines) while the agent further optimizes the diagram beginning with a non-greedy Pi action (orange line).
Figure 4: Analysis of learned policy. (a) Action dependence on the local environment. 1000 actions of each type are sampled by the agent. Then, for each action and the diagram in which it was chosen, sub-diagrams are built up in layers around the node/edge identified with the action (see inlay). For each sub-diagram spanning only the nodes in a specific layer, we compute the agent's unnormalized probability of sampling the chosen action $P_\mathrm{layer}$ and compute the difference $\epsilon$ to its probability $P_\mathrm{complete}$ in the full diagram, where we define $\epsilon$ in \ref{['eq:error_dist']}. We plot the average of this difference against the number of layers for $5$ action types. (b) Probability of sampling the Copy action on the blue edge in the diagram depicted in the inlay for multiple outputs of the diagram $n_\mathrm{out}$ and multiple additionally inserted spiders on the outputs $n_\mathrm{extra}$. The ideal strategy is to select the Copy action for $n_\mathrm{out} - n_\mathrm{extra} \leq 2$. The agent approximately learns the ideal policy.
Figure 5: (a) Translation of common quantum gates, states, and (post-selected) measurements into corresponding ZX-diagrams. The translations are true only up to a scalar factor. Square boxes are Z-/X-rotation gates with an angle $\alpha$. (b) By inserting Hadamards (black boxes) on all edges connected to a spider, its color can be changed.
...and 5 more figures

Optimizing ZX-Diagrams with Deep Reinforcement Learning

TL;DR

Abstract

Optimizing ZX-Diagrams with Deep Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)