Table of Contents
Fetching ...

Micro Blossom: Accelerated Minimum-Weight Perfect Matching Decoding for Quantum Error Correction

Yue Wu, Namitha Liyanage, Lin Zhong

TL;DR

Micro Blossom delivers the first publicly known exact MWPM decoder with sub-$\mu s$ latency for quantum error correction by partitioning the blossom algorithm across many fine-grained processing units on a programmable accelerator and a software primal phase. The architecture employs vertex- and edge-level parallelism, per-vertex local state, and round-wise fusion to achieve $O(|V|^3)$ worst-case and $O(p^2|V|^2)$ average latency improvements, demonstrated on an FPGA with $d=13$ and $p=0.1\%$ yielding $0.8\ \mu s$ average decoding latency. It also introduces practical resource-efficient variants and isolated-conflict handling to reduce CPU–accelerator interactions, delivering an 8x latency improvement over prior exact MWPM implementations. The work shows real-time, exact MWPM decoding is feasible for fault-tolerant quantum computation and provides an open-source artifact to enable further hardware-accelerated QEC research.

Abstract

Minimum-Weight Perfect Matching (MWPM) decoding is important to quantum error correction decoding because of its accuracy. However, many believe that it is difficult, if possible at all, to achieve the microsecond latency requirement posed by superconducting qubits. This work presents the first publicly known MWPM decoder, called Micro Blossom, that achieves sub-microsecond decoding latency. Micro Blossom employs a heterogeneous architecture that carefully partitions a state-of-the-art MWPM decoder between software and a programmable accelerator with parallel processing units, one of each vertex/edge of the decoding graph. On a surface code with code distance $d$ and a circuit-level noise model with physical error rate $p$, Micro Blossom's accelerator employs $O(d^3)$ parallel processing units to reduce the worst-case latency from $O(d^{12})$ to $O(d^9)$ and reduce the average latency from $O(p d^3+1)$ to $O(p^2 d^2+1)$ when $p \ll 1$. We report a prototype implementation of Micro Blossom using FPGA. Measured at $d=13$ and $p=0.1\%$, the prototype achieves an average decoding latency of $0.8 μs$ at a moderate clock frequency of $62 MHz$. Micro Blossom is the first publicly known hardware-accelerated exact MWPM decoder, and the decoding latency of $0.8 μs$ is 8 times shorter than the best latency of MWPM decoder implementations reported in the literature.

Micro Blossom: Accelerated Minimum-Weight Perfect Matching Decoding for Quantum Error Correction

TL;DR

Micro Blossom delivers the first publicly known exact MWPM decoder with sub- latency for quantum error correction by partitioning the blossom algorithm across many fine-grained processing units on a programmable accelerator and a software primal phase. The architecture employs vertex- and edge-level parallelism, per-vertex local state, and round-wise fusion to achieve worst-case and average latency improvements, demonstrated on an FPGA with and yielding average decoding latency. It also introduces practical resource-efficient variants and isolated-conflict handling to reduce CPU–accelerator interactions, delivering an 8x latency improvement over prior exact MWPM implementations. The work shows real-time, exact MWPM decoding is feasible for fault-tolerant quantum computation and provides an open-source artifact to enable further hardware-accelerated QEC research.

Abstract

Minimum-Weight Perfect Matching (MWPM) decoding is important to quantum error correction decoding because of its accuracy. However, many believe that it is difficult, if possible at all, to achieve the microsecond latency requirement posed by superconducting qubits. This work presents the first publicly known MWPM decoder, called Micro Blossom, that achieves sub-microsecond decoding latency. Micro Blossom employs a heterogeneous architecture that carefully partitions a state-of-the-art MWPM decoder between software and a programmable accelerator with parallel processing units, one of each vertex/edge of the decoding graph. On a surface code with code distance and a circuit-level noise model with physical error rate , Micro Blossom's accelerator employs parallel processing units to reduce the worst-case latency from to and reduce the average latency from to when . We report a prototype implementation of Micro Blossom using FPGA. Measured at and , the prototype achieves an average decoding latency of at a moderate clock frequency of . Micro Blossom is the first publicly known hardware-accelerated exact MWPM decoder, and the decoding latency of is 8 times shorter than the best latency of MWPM decoder implementations reported in the literature.

Paper Structure

This paper contains 52 sections, 7 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Surface code and decoding graph. (a) The surface code interleaves data qubits ($\CIRCLE$) with stabilizer qubits ($\Circle$). Here we only show $\hat{Z}$-type stabilizer qubits that detect $\hat{X}$ errors. The $\hat{X}$-type stabilizes can be decoded likewise independently. (b) The decoding graph of (a). Each vertex represents a stabilizer measurement; each edge represents a potential error. Stabilizers with flipped measurement and their vertices (defect vertices) are marked in red in both figures. (c) A decoding graph from a circuit-level implementation of the surface code with $d$ rounds of measurements.
  • Figure 2: Potential speed up according to Amdahl's Law, sampled from the Fusion Blossom wu2023qce running on Apple M1 Max. The potential speedup is the theoretical upper bound of optimizing the dual phase.
  • Figure 3: A node is either matched or in an alternating tree. The primal phase maintains the tight edges of both the solid lines ($x_e = 1$) and dotted lines ($x_e = 0$). The radius of a blossom $S$ represents the corresponding dual variable $y_S$. The direction of each node $\Delta y_S \in \{ 0, +1, -1 \}$ is marked, with different colors.
  • Figure 4: Fault-tolerant logical $\hat{T}$ gate on the target qubit is implemented using a resource qubit in the magic $|T\rangle$ state bravyi2005universal and a circuit consisting of fault-tolerant Clifford gates and a conditional logical $\hat{S}$ gate with decoder feedforward.
  • Figure 5: Heterogeneous Architecture of Micro Blossom. The blue blocks and green cylinders represent vPUs and ePUs, respectively. An instruction is first broadcast to all PUs, then each PU updates its local state and generates a response which is convergecasted into a single response. Each PU only talks to its immediate neighbors on the decoding graph. The syndrome data from the quantum hardware is directly loaded to the vPUs in a stream manner.
  • ...and 6 more figures