Reinforcement Learning Assisted Recursive QAOA

Yash J. Patel; Sofiene Jerbi; Thomas Bäck; Vedran Dunjko

Reinforcement Learning Assisted Recursive QAOA

Yash J. Patel, Sofiene Jerbi, Thomas Bäck, Vedran Dunjko

TL;DR

This work identifies limitations of recursive QAOA (RQAOA) for NP-hard Ising instances and introduces RL-RQAOA, a quantum–classical hybrid that uses reinforcement learning to learn edge-elimination decisions (and optionally QAOA angles) within the RQAOA framework. By framing the variable-elimination step as a multi-step RL problem with a softmax-based policy over two-qubit correlations, and by also proposing a classical analogue RL-RONE to isolate quantum contributions, the authors demonstrate that RL-RQAOA outperforms RQAOA on hard instances and retains competitive performance on typical cases. Numerical experiments on ensembles of random $d$-regular graphs show RL-RQAOA provides a significant advantage on hard instances and a measurable quantum contribution when compared to RL-RONE, indicating a beneficial synergy between reinforcement learning and quantum-inspired optimization. The study suggests that learning-based enhancements to non-local quantum heuristics can yield practical improvements on near-term devices and motivates exploration at higher depths and with real quantum hardware for potential quantum advantages. $H_n = obreak \\sum_{u} h_u Z_u + \\sum_{(u,v)} J_{uv} Z_u Z_v$ and the RL policy $\\pi_{\\theta}(a|s)=\\frac{\\exp(\\beta_{u,v}|M_{u,v}|)}{\\sum_{(u,v)\\in E} \\exp(\\beta_{u,v}|M_{u,v}|)}$ are central to the method, linking quantum correlations to learnable decision rules.

Abstract

Variational quantum algorithms such as the Quantum Approximation Optimization Algorithm (QAOA) in recent years have gained popularity as they provide the hope of using NISQ devices to tackle hard combinatorial optimization problems. It is, however, known that at low depth, certain locality constraints of QAOA limit its performance. To go beyond these limitations, a non-local variant of QAOA, namely recursive QAOA (RQAOA), was proposed to improve the quality of approximate solutions. The RQAOA has been studied comparatively less than QAOA, and it is less understood, for instance, for what family of instances it may fail to provide high quality solutions. However, as we are tackling $\mathsf{NP}$-hard problems (specifically, the Ising spin model), it is expected that RQAOA does fail, raising the question of designing even better quantum algorithms for combinatorial optimization. In this spirit, we identify and analyze cases where RQAOA fails and, based on this, propose a reinforcement learning enhanced RQAOA variant (RL-RQAOA) that improves upon RQAOA. We show that the performance of RL-RQAOA improves over RQAOA: RL-RQAOA is strictly better on these identified instances where RQAOA underperforms, and is similarly performing on instances where RQAOA is near-optimal. Our work exemplifies the potentially beneficial synergy between reinforcement learning and quantum (inspired) optimization in the design of new, even better heuristics for hard problems.

Reinforcement Learning Assisted Recursive QAOA

TL;DR

-regular graphs show RL-RQAOA provides a significant advantage on hard instances and a measurable quantum contribution when compared to RL-RONE, indicating a beneficial synergy between reinforcement learning and quantum-inspired optimization. The study suggests that learning-based enhancements to non-local quantum heuristics can yield practical improvements on near-term devices and motivates exploration at higher depths and with real quantum hardware for potential quantum advantages.

and the RL policy

are central to the method, linking quantum correlations to learnable decision rules.

Abstract

-hard problems (specifically, the Ising spin model), it is expected that RQAOA does fail, raising the question of designing even better quantum algorithms for combinatorial optimization. In this spirit, we identify and analyze cases where RQAOA fails and, based on this, propose a reinforcement learning enhanced RQAOA variant (RL-RQAOA) that improves upon RQAOA. We show that the performance of RL-RQAOA improves over RQAOA: RL-RQAOA is strictly better on these identified instances where RQAOA underperforms, and is similarly performing on instances where RQAOA is near-optimal. Our work exemplifies the potentially beneficial synergy between reinforcement learning and quantum (inspired) optimization in the design of new, even better heuristics for hard problems.

Paper Structure (20 sections, 2 theorems, 21 equations, 6 figures)

This paper contains 20 sections, 2 theorems, 21 equations, 6 figures.

Introduction
Background
Quantum Approximate Optimization Algorithm
Classical Simulatability of QAOA for the Ising problem
Recursive QAOA
Reinforcement Learning Primer
Related Work
Limitations of RQAOA
Reinforcement Learning Enhanced RQAOA & Classical Brute Force Policy
Numerical Advantage of RL-RQAOA over RQAOA
Hard Instances for RQAOA
Benchmarking
RQAOA vs RL-RQAOA on Cage Graphs
RQAOA vs RL-RQAOA on hard instances
RL-RQAOA vs RL-RONE
...and 5 more sections

Key Result

Theorem 1

( bravyi2020obstaclesozaeta2021expectation) Given an Ising cost Hamiltonian $H_n = \sum_{u \in V} h_u Z_u + \sum_{(u,v) \in E} J_{uv} Z_u Z_v$. Define $s(x) := sin(x)$ and $c(x) := cos(x)$. Then for a fixed pair of qubits $1 \leq u \leq v \leq n$, where, and Here, w.l.o.g we assume that the underlying graph is a complete graph $K_n$ and $\gamma = 1$ since it can be absorbed into the definition

Figures (6)

Figure 1: Training QAOA-based policies for reinforcement learning. We consider an RL-enhanced recursive QAOA (RL-RQAOA) scenario where a hybrid quantum-classical agent learns by interacting with an environment which we represent as a search tree induced by the recursive framework of RQAOA. The agent samples the next action $a$ (corresponding to selecting an edge and its sign) from its policy $\pi_{\theta}(a|s)$ and receives feedback in the form of a reward $r$, where each state corresponds to a graph (the state space is characterized by a search tree of weighted graphs, where each node of the tree corresponds to a graph). The nodes at each level of the search tree correspond to the candidate states for an agent to perceive by taking action. For our hybrid agents, the policy $\pi_{\theta}$ of RL-RQAOA (see Def. \ref{['def:softmax_qaoa']}) along with the gradient estimate $\nabla_{{\theta}} \log \pi_{{\theta}}$ is evaluated on a CPU as we are in the regime where depth $l=1$. However, the policy can also be evaluated on a quantum processing unit (QPU) for higher depths, when classical simulations can only be performed efficiently for graphs of small size. The training of the policy is performed by a classical algorithm such as $\mathsf{REINFORCE}$ (see Alg. \ref{['alg:reinforce']}), which uses sample interactions and policy gradients to update parameters.
Figure 2: Illustration of a counterexample where the heuristic of using the energy-optimal QAOA angles in RQAOA fails. Here, we show that for the weighted graph (9 vertices and 24 edges) depicted in (a), RQAOA makes a mistake even in its strongest regime, so at the very first iteration (i.e., $n_c = 8$). The two-correlation coefficients for each edge (at energy-optimal angles) are shown in the form of a horizontal bar plot in (b), where the edge $(0,2)$ has the maximal correlation coefficient. For the graph in (a), RQAOA with energy-optimal angles assigns a wrong edge-correlation (sign) to this edge which is precisely highlighted by a bold star in (c) and (d). Both (c) and (d) characterize the sets of good and bad QAOA angles where RQAOA makes a correct and a wrong choice, respectively. This example is counter-intuitive: as the edge $(0,2)$ has the highest weight in the graph, intuitively, the variables should be correlated (same sign) as to maximize the energy. However, this leads to a sub-optimal solution which RQAOA achieves with energy-optimal angles. Yet, for different settings of QAOA angles which do not maximize the overall energy, this edge will still have the largest magnitude of correlation, but in this case, anti-correlation, which leads to the true optimum (see sub-figure (c)).
Figure 3: Number of ties per iteration of RQAOA (average over 200 runs) for $(3,8)$-cage graph ($30$ vertices, $45$ edges, edge weights $(\{-1, +1\})$. We chose $n_c=8$ in our simulations where RQAOA achieved a mean approximation ratio of $0.955 \pm 0.036$ and the probability to reach the ground state was $33.5\%$. The y-axis (Number of Ties) is log-scaled. The black crosses depict the mean values, with the error bar showing the 95% confidence interval of 200 independent runs. The figure illustrates that one would invariably encounter a constant fraction of ties between maximal two-correlations no matter whatever path is chosen in the search tree, implying an exponential blow-up in the size of the search tree to be explored by RQAOA.
Figure 4: Comparison of success probability in attaining ground state solutions of RL-RQAOA and RQAOA on cage graphs. The x-axis depicts the properties of cage graph(s), for instance, d3-g6 denotes that the instance is $3$-regular with girth (length of the shortest cycle) being 6. The error-bars appear only for few instances (specifically for d3-g9, d3-g10 and d5-g5) because of the existence of multiple graph instances with the same properties (degree and girth). The evaluation of RL-RQAOA was done by evaluating the average learning performance over 15 independent runs. While, for RQAOA, the best energy is taken when given a fixed budget of 1400 runs. The probability for RL-RQAOA-max is computed by taking the maximum energy attained by the agent over all 15 independent runs for a particular episode. One the other hand, the probability for RL-RQAOA-vote (statistically more significant) is computed by aggregating the maximum energy attained for a particular episode only if more than 50% of the runs agree. We chose $n_c=8$ for instances with nodes $\leq 50$ and $n_c=10$ otherwise. The parameters $\theta = (\alpha, \gamma, \vec{\beta})$ of the RL-RQAOA policy were initialized by setting $\vec{\beta} = \{25\}^{{(n^2-n)}/2}$ and the angles $\{\alpha, \gamma\}$ (at every iteration) to energy-optimal angles (i.e., by following one run of RQAOA). All agents were trained using $\mathsf{REINFORCE}$ (Alg. \ref{['alg:reinforce']}).
Figure 5: Numerical evidence of the advantage of RL-RQAOA over RQAOA in terms of approximation ratio on hard instances. The box plot is generated by taking the mean of the best approximation ratio over 15 independent runs of 1400 episodes for RL-RQAOA. The RL-RQAOA clearly outperforms RQAOA in terms of approximation ratio for the instances considered (these are exactly the instances where RQAOA's approx. ratio $\leq 0.95)$. We chose $n_c=8$ in our simulations and the parameters $\theta = (\alpha, \gamma, \vec{\beta})$ of the RL-RQAOA policy were initialized by setting $\vec{\beta} = \{25\}^{{(n^2-n)}/2}$ and the angles $\{\alpha, \gamma\}$ (at every iteration) were initialized randomly. All agents were trained using $\mathsf{REINFORCE}$ (Alg. \ref{['alg:reinforce']}).
...and 1 more figures

Theorems & Definitions (4)

Definition 1: Policy of RL-RQAOA
Definition 2: Policy of RL-RONE
Theorem 1
Theorem 2

Reinforcement Learning Assisted Recursive QAOA

TL;DR

Abstract

Reinforcement Learning Assisted Recursive QAOA

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)