Table of Contents
Fetching ...

Pushing the Boundary of Quantum Advantage in Hard Combinatorial Optimization with Probabilistic Computers

Shuvro Chowdhury, Navid Anjum Aadit, Andrea Grimaldi, Eleonora Raimondo, Atharva Raut, P. Aaron Lott, Johan H. Mentink, Marek M. Rams, Federico Ricci-Tersenghi, Massimo Chiappini, Luke S. Theogarajan, Tathagata Srimani, Giovanni Finocchio, Masoud Mohseni, Kerem Y. Camsari

TL;DR

This work investigates probabilistic computers (p‑computers) as a scalable classical pathway for hard combinatorial optimization, comparing discrete‑time simulated quantum annealing (DT‑SQA) and adaptive parallel tempering (APT) with isoenergetic cluster moves to a leading quantum annealer on 3D spin glasses. It demonstrates that increasing replica counts improves DT‑SQA scaling and that APT+ICM achieves superior time‑to‑solution via a universal finite‑size scaling, outperforming DT‑SQA in long runs. The authors show architecture‑level benefits, including FPGA acceleration and a plausible ASIC design capable of hosting millions of p‑bits, with energy efficiency vastly superior to GPUs/TPUs. Altogether, the results establish a rigorous classical baseline and argue that co‑designed p‑computers can approach or rival quantum solvers for real‑world optimization tasks while offering scalable, energy‑efficient hardware pathways. This sets a concrete foundation for evaluating practical quantum advantage in optimization and points toward a broad, hardware‑accelerated future for sampling and inference in discrete spaces.

Abstract

Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers), when co-designed with hardware to implement powerful Monte Carlo algorithms, provide a compelling and scalable classical pathway for solving hard optimization problems. We focus on two key algorithms applied to 3D spin glasses: discrete-time simulated quantum annealing (DT-SQA) and adaptive parallel tempering (APT). We benchmark these methods against the performance of a leading quantum annealer on the same problem instances. For DT-SQA, we find that increasing the number of replicas improves residual energy scaling, in line with expectations from extreme value theory. We then show that APT, when supported by non-local isoenergetic cluster moves, exhibits a more favorable scaling and ultimately outperforms DT-SQA. We demonstrate these algorithms are readily implementable in modern hardware, projecting that custom Field Programmable Gate Arrays (FPGA) or specialized chips can leverage massive parallelism to accelerate these algorithms by orders of magnitude while drastically improving energy efficiency. Our results establish a new, rigorous classical baseline, clarifying the landscape for assessing a practical quantum advantage and presenting p-computers as a scalable platform for real-world optimization challenges.

Pushing the Boundary of Quantum Advantage in Hard Combinatorial Optimization with Probabilistic Computers

TL;DR

This work investigates probabilistic computers (p‑computers) as a scalable classical pathway for hard combinatorial optimization, comparing discrete‑time simulated quantum annealing (DT‑SQA) and adaptive parallel tempering (APT) with isoenergetic cluster moves to a leading quantum annealer on 3D spin glasses. It demonstrates that increasing replica counts improves DT‑SQA scaling and that APT+ICM achieves superior time‑to‑solution via a universal finite‑size scaling, outperforming DT‑SQA in long runs. The authors show architecture‑level benefits, including FPGA acceleration and a plausible ASIC design capable of hosting millions of p‑bits, with energy efficiency vastly superior to GPUs/TPUs. Altogether, the results establish a rigorous classical baseline and argue that co‑designed p‑computers can approach or rival quantum solvers for real‑world optimization tasks while offering scalable, energy‑efficient hardware pathways. This sets a concrete foundation for evaluating practical quantum advantage in optimization and points toward a broad, hardware‑accelerated future for sampling and inference in discrete spaces.

Abstract

Recent demonstrations on specialized benchmarks have reignited excitement for quantum computers, yet whether they can deliver an advantage for practical real-world problems remains an open question. Here, we show that probabilistic computers (p-computers), when co-designed with hardware to implement powerful Monte Carlo algorithms, provide a compelling and scalable classical pathway for solving hard optimization problems. We focus on two key algorithms applied to 3D spin glasses: discrete-time simulated quantum annealing (DT-SQA) and adaptive parallel tempering (APT). We benchmark these methods against the performance of a leading quantum annealer on the same problem instances. For DT-SQA, we find that increasing the number of replicas improves residual energy scaling, in line with expectations from extreme value theory. We then show that APT, when supported by non-local isoenergetic cluster moves, exhibits a more favorable scaling and ultimately outperforms DT-SQA. We demonstrate these algorithms are readily implementable in modern hardware, projecting that custom Field Programmable Gate Arrays (FPGA) or specialized chips can leverage massive parallelism to accelerate these algorithms by orders of magnitude while drastically improving energy efficiency. Our results establish a new, rigorous classical baseline, clarifying the landscape for assessing a practical quantum advantage and presenting p-computers as a scalable platform for real-world optimization challenges.

Paper Structure

This paper contains 32 sections, 26 equations, 24 figures, 1 table, 2 algorithms.

Figures (24)

  • Figure 1: Schematic overview of the probabilistic algorithms and technological platforms: (a) Representation of a 3D Ising spin glass, where the weights connecting spins can be $+1$ (ferromagnetic, red) or $-1$ (antiferromagnetic, blue). Many real-world hard optimization problems can be mapped onto such spin-glass systems. (b, c) Two replica-based algorithms investigated in this work: discrete-time simulated quantum annealing (DT-SQA), in (b), and adaptive parallel tempering (APT) with isoenergetic cluster moves (ICM), in (c). In SQA, replicas are interconnected, and the strength of these interconnections is annealed according to a schedule. APT uses multiple replicas of the problem, each running at different temperatures, and periodically exchanges states between replicas to escape local minima. (d) The p-computing scheme, where a general-purpose CPU interfaces with a specialized probabilistic computer to efficiently implement Monte Carlo algorithms. Various implementations of probabilistic computers are shown, including CPU chowdhury2023acceleratedaadit2022massively, GPU Onizawa2025block2010multipreis2009gpuyang2019highfang2014parallel, Field Programmable Gate Arrays (FPGA) aadit2022massivelyniazi2024trainingnikhar2024allaadit2023acceleratingaadit2021computingaadit2022physics, interconnected FPGAs, and monolithic CMOS + sMTJ (stochastic magnetic tunnel junction) chips singh2024cmosgrimaldi2022experimentalsingh2023hardwareborders2019integer. Each platform offers trade-offs in sampling throughput, energy efficiency, problem size, and technological complexity.
  • Figure 2: Scaling improvement of the discrete-time simulated quantum annealing (DT-SQA) algorithm: The final residual energy, $\rho_\mathrm{E}^\mathrm{f}$, as a function of annealing time, $t_\mathrm{a}$ (in Monte Carlo sweeps, (MCS)), for DT-SQA simulations with varying Trotter replicas, $R$, is shown for 3D Ising spin glass problems ($15 \times 15 \times 12$ with 2687 spins) using CPUs. The slope increases in absolute value with $R$. Each data point is averaged over 300 instances and 50 independent runs. The best energy among all Trotter replicas is selected at the end of $t_\mathrm{a}$. Error bars denote the 95% bootstrap confidence interval of the mean across spin-glass instances. An MCS is an algorithmic unit representing one update attempt per spin; hardware differences alter the wall-clock time per sweep but not the scaling exponent $\kappa_{\mathrm{f}}$. The comparison with the quantum annealer's physical time in (b) is therefore a comparison of this dimensionless exponent, not a direct conversion of time units. A full analysis of wall-clock time, which incorporates these hardware-specific prefactors, is presented in Section \ref{['sec:arch_based']} and Fig. \ref{['fig:prefactor']}. (b) Comparison of the DT-SQA scaling exponent with that of the quantum annealer (QA) from Ref. King2023quantum. With a sufficient number of replicas ($R$$=$ 2850), DT-SQA achieves a more favorable scaling exponent. See Supplementary Fig. \ref{['fig:SQA_embedded2']} for similar results on embedded instances. (c) The slope improvement is plotted against $R$, showing alignment with extreme value theory predictions (red dashed line; see Supplementary Section \ref{['sec:supp_evt']}). The DT-SQA scaling exponent becomes comparable to the QA's at $R \approx 2800$ and exceeds it for larger values. Error bars denote 95% confidence interval of fitting.
  • Figure 3: Scaling improvement of the APT algorithm with ICM: (a) Iso-Trotter replica comparison of DT-SQA and adaptive parallel tempering (APT) with isoenergetic cluster moves (ICM) on CPUs for the same logical instances as in Fig. \ref{['fig:scaling_SQA']}, using 50 independent runs per instance. A sweep-to-swap ratio of 1 minimizes residual energy for APT. While DT-SQA approaches a flat plateau, APT shows a sharp bending towards lower residual energy. (b) Finite-size scaling analysis for various cube sizes ($L$) reveals a universal collapse of the data. The later bending appears to be universal for APT, as shown by averaging over 200 independent runs per instance. For each size, 4 ICM replicas are used, with the total number of replicas varying due to the adaptive nature of the pre-processing algorithm. A similar collapse for APT without ICM is provided in the Supplementary Fig. \ref{['fig:rho_ta_L_woICM']}. Error bars represent the 95% bootstrap confidence interval of the mean across spin-glass instances.
  • Figure 4: Architectural improvement with probabilistic computers in hardware: (a) Algorithmic complexity of the 3D spin glass problem as a function of the number of p-bits, showing the minimum number of Monte Carlo sweeps (MCS) required to reach a target threshold residual energy, $\rho_{\mathrm{E0}}^\mathrm{f} = 0.007$, using APT + ICM, independent of hardware platform. For each lattice size ($L$), 300 logical instances with 50 independent runs are reported. Swaps are probabilistically performed pairwise between adjacent $\beta$ replicas, alternating between even and odd pairs. Error bars (95% confidence) are small and often invisible. (b) Measured average MCS time per replica for CPU and FPGA. CPU shows $\mathcal{O}(n)$ scaling, while FPGA achieves $\mathcal{O}(1)$ scaling due to the massively parallel p-computer architecture. sMTJ projections assume experimentally demonstrated nanosecond p-bits hayakawa2021nanosecondsoumah2024nanosecond. (c) Total estimated time to reach $\rho_{\mathrm{E0}}^\mathrm{f} = 0.007$, combining (a) and (b). Time for ICM and swaps are negligible compared to sampling times (see Methods section) and are excluded from this estimation. Also, it is assumed that FPGA can run all the replicas in parallel (as long as they fit on a single chip nikhar2024all; see Supplementary Section \ref{['sec:supp_chip']}). We assume the same for CPU as well and do not include the replica factor when estimating these times. FPGA maintains $\mathcal{O}(n)$ improvement over CPU, while 'prefactor improvement' for sMTJ-based p-computers refers to the additional constant speed-up expected when the same architecture is implemented with fast, on-chip sMTJ p-bits (approximately 1 ns intrinsic flip time), relative to the measured FPGA sweep time. sMTJ projections assume a single chip with 1 million sMTJs, achieving 1 million flips per ns (the thick orange line assumes 10 and 50 replicas for the upper and lower bounds, respectively). This also includes improvements stemming from additional parallelization, meaning that if the problem size is smaller than the chip's capacity, multiple independent runs or problem instances can be processed simultaneously on the same chip, scaling as $10^6 / (\text{spins per replica} \times \text{APT replicas})$.
  • Figure S1: Slope of the residual energy, $\rho_E^f$ as a function of total Monte Carlo effort: The residual energy is re-plotted against the total Monte Carlo effort for the logical instances of size $15\times15\times12$. Here total Monte Carlo effort is defined as the total number of attempted spin flips during the whole annealing process. We emphasize that doing so only changes the prefactor and does not change the slope of the plots because of the power-law nature since multiplying time with any constant factor does not change the slope. We also emphasize that the Total Monte Carlo effort is a fixed-size CPU-centric measure, as discussed in Ref. ronnow_defining_2014, not appropriate in our context, as we discuss in the text.
  • ...and 19 more figures