Efficient and Scalable Architecture for Multiple-chip Implementation of Simulated Bifurcation Machines

Tomoya Kashimata; Masaya Yamasaki; Ryo Hidaka; Kosuke Tatsumura

Efficient and Scalable Architecture for Multiple-chip Implementation of Simulated Bifurcation Machines

Tomoya Kashimata, Masaya Yamasaki, Ryo Hidaka, Kosuke Tatsumura

TL;DR

This paper tackles scaling Ising-machine computations by introducing a streaming, multi-chip architecture for simulated bifurcation (SB) that maintains full spin-to-spin connectivity. By harmonizing in-chip data flow with inter-chip communication in a dual-ring topology, the design overlaps computation and communication, achieving near-ideal strong scaling limited only by chip-to-chip latency. The authors validate a cycle-level performance model against extensive FPGA experiments, achieving up to 97.9% pipeline efficiency on 32,768 spins across eight chips and projecting comparable performance to state-of-the-art optical Ising machines for 100,000 spins on a 79-FPGA cluster. The work demonstrates that, with optimized data partitioning, streaming processing, and latency-aware scheduling, SB-based Ising machines can outperform prior multi-chip approaches and approach CIM performance, offering a practical route to large-scale combinatorial optimization on digital hardware.

Abstract

Ising machines are specialized computers for finding the lowest energy states of Ising spin models, onto which many practical combinatorial optimization problems can be mapped. Simulated bifurcation (SB) is a quantum-inspired parallelizable algorithm for Ising problems that enables scalable multi-chip implementations of Ising machines. However, the computational performance of a previously proposed multi-chip architecture tends to saturate as the number of chips increases for a given problem size because both computation and communication are exclusive in the time domain. In this paper, we propose a streaming architecture for multi-chip implementations of SB-based Ising machines with full spin-to-spin connectivity. The data flow in in-chip computation is harmonized with the data flow in inter-chip communication, enabling the computation and communication to overlap and the communication time to be hidden. Systematic experiments demonstrate linear strong scaling of performance up to the vicinity of the ideal communication limit determined only by the latency of chip-to-chip communication. Our eight-FPGA (field-programmable gate array) cluster can compute a 32,768-spin problem with a high pipeline efficiency of 97.9%. The performance of a 79-FPGA cluster for a 100,000-spin problem, projected using a theoretical performance model validated on smaller experimental clusters, is comparable to that of a state-of-the-art 100,000-spin optical Ising machine.

Efficient and Scalable Architecture for Multiple-chip Implementation of Simulated Bifurcation Machines

TL;DR

Abstract

Paper Structure (22 sections, 17 equations, 12 figures, 5 tables, 2 algorithms)

This paper contains 22 sections, 17 equations, 12 figures, 5 tables, 2 algorithms.

Introduction
Cluster Design
Chip Design
Scalability
Cycle-level performance model
Weak scaling and strong scaling
Performance
Comparison with a State-of-the-Art Ising Machine
Related Works
Conclusion
Ising problem
Simulated bifurcation
Implementation
Implementation on Arria10 FPGAs
Implementation on Stratix10 FPGAs
...and 7 more sections

Figures (12)

Figure 1: Cluster size (number of chips) dependency of computational performance at a fixed problem size for different cluster architectures featuring the overlapping execution (this work) and exclusive execution (a previous work) of computation and communication.
Figure 2: Cluster design (a) Network topology of the cluster. (b) Partition and order of the computation. (c) Timing chart of each chip (Chip 1 as an example) [the latency of computation and communication logic is ignored for explanatory simplicity].
Figure 3: Chip design. (a) Block diagram for each chip. (b) Implementation details of MAC component. $\mathrm{\Delta p_\mathrm{REG}}$ serves as both the accumulators of the MAC units and the shift register for output.
Figure 4: Timing chart of a chip for each mode. Each of the hexagons corresponds to the duration of inputting a subvector. (a) Mode A: computational throughput limiting mode. (b) Mode B: intermediate mode. (c) Mode C: communication latency limiting mode. The red and black filled hexagons represent the duration of inputting a subvector computed by the chip itself. The non-filled hexagons represent the duration of inputting a subvector received from the other chips.
Figure 5: Measured and theoretical-model performance of the proposed clusters for SBMs. The red plots represent the measured performance. The solid lines illustrate the model performance. The dotted lines correspond to the ideal limits determined by computational throughput and communication latency. (a) Weak scaling characteristics. Increase the number of chips ($P_\mathrm{chip}$) and the problem size ($N$) in the same proportion [the numbers of oscillators per chip $N/P_\mathrm{chip}$ are fixed]. (b) Strong scaling characteristics. Increase the number of chips ($P_\mathrm{chip}$) at a fixed problem size ($N$) [the numbers of oscillators ($N$) are fixed]. (c) Comparison of measured performance and ideal limits at $N = 16,384$.
...and 7 more figures

Efficient and Scalable Architecture for Multiple-chip Implementation of Simulated Bifurcation Machines

TL;DR

Abstract

Efficient and Scalable Architecture for Multiple-chip Implementation of Simulated Bifurcation Machines

Authors

TL;DR

Abstract

Table of Contents

Figures (12)