Table of Contents
Fetching ...

Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation

Andi Gu, J. Pablo Bonilla Ataides, Mikhail D. Lukin, Susanne F. Yelin

Abstract

Quantum error correction (QEC) is essential for scalable quantum computing. However, it requires classical decoders that are fast and accurate enough to keep pace with quantum hardware. While quantum low-density parity-check codes have recently emerged as a promising route to efficient fault tolerance, current decoding algorithms do not allow one to realize the full potential of these codes in practical settings. Here, we introduce a convolutional neural network decoder that exploits the geometric structure of QEC codes, and use it to probe a novel "waterfall" regime of error suppression, demonstrating that the logical error rates required for large-scale fault-tolerant algorithms are attainable with modest code sizes at current physical error rates, and with latencies within the real-time budgets of several leading hardware platforms. For example, for the $[144, 12, 12]$ Gross code, the decoder achieves logical error rates up to $\sim 17$x below existing decoders - reaching logical error rates $\sim 10^{-10}$ at physical error $p=0.1\%$ - with 3-5 orders of magnitude higher throughput. This decoder also produces well-calibrated confidence estimates that can significantly reduce the time overhead of repeat-until-success protocols. Taken together, these results suggest that the space-time costs associated with fault-tolerant quantum computation may be significantly lower than previously anticipated.

Scalable Neural Decoders for Practical Fault-Tolerant Quantum Computation

Abstract

Quantum error correction (QEC) is essential for scalable quantum computing. However, it requires classical decoders that are fast and accurate enough to keep pace with quantum hardware. While quantum low-density parity-check codes have recently emerged as a promising route to efficient fault tolerance, current decoding algorithms do not allow one to realize the full potential of these codes in practical settings. Here, we introduce a convolutional neural network decoder that exploits the geometric structure of QEC codes, and use it to probe a novel "waterfall" regime of error suppression, demonstrating that the logical error rates required for large-scale fault-tolerant algorithms are attainable with modest code sizes at current physical error rates, and with latencies within the real-time budgets of several leading hardware platforms. For example, for the Gross code, the decoder achieves logical error rates up to x below existing decoders - reaching logical error rates at physical error - with 3-5 orders of magnitude higher throughput. This decoder also produces well-calibrated confidence estimates that can significantly reduce the time overhead of repeat-until-success protocols. Taken together, these results suggest that the space-time costs associated with fault-tolerant quantum computation may be significantly lower than previously anticipated.

Paper Structure

This paper contains 6 sections, 3 equations, 9 figures.

Figures (9)

  • Figure 1: Structure-aware neural decoding for quantum error correction. (a) Top: errors accumulate on data qubits; stabilizer measurements produce a spacetime syndrome (pattern of detection events); a decoder determines whether a logical error occurred; the logical state is protected. Bottom (neural network decoder): syndromes are embedded into $H$-dimensional representations and processed by $L$ convolutional layers whose local structure respects the code geometry---3D convolutions for surface codes, generalized convolutions on the torus for bivariate bicycle (BB) codes---followed by a final convolution scattering to data qubits, pooling over the data qubits in each logical operator's support, and a prediction head applied independently to each logical observable. (b) Error suppression on the $\llbracket 144, 12, 12 \rrbracket$ BB code ($R = d$ rounds). The logical error rate per logical qubit per cycle, $P_L \approx \sum_w N(w)\,p^w$, where $N(w)$ is the number of minimal failure modes of weight $w$, decomposes into two regimes: a steep waterfall ($\sim p^{10.8}$) where the numerous high-weight failure modes dominate at moderate physical error rates $p$, transitioning to a distance-limited floor ($\sim p^{6.4}$) at very low noise. BP+OSD (orange, $\sim p^{5.4}$) misses the waterfall entirely. (c) Accuracy--latency tradeoff at $p=0.2\%$: Cascade (GPU inference on NVIDIA H200) spans a range of (amortized) latencies while achieving lower logical error rates than prior decoders (single-threaded CPU). Diamond markers show the reported single-shot latencies of BP+OSD roffe2020decoding, Relay muller2024relax, Tesseract shutty2025tesseract, and the ML decoder of blue2025machinelearningdecodingcircuitlevel. Error bars indicate 95% credible intervals.
  • Figure 1: Bottleneck residual block. The network backbone consists of $L$ identically-structured blocks with independent learned parameters, each processing a hidden representation of dimension $H$ at every syndrome location. First, the representation is projected from $H$ to $H/4$ dimensions (reduce); the code-specific convolution then operates in this lower-dimensional space (message passing), after which the representation is projected back to $H$ dimensions (restore). The bottleneck reduces the cost of the convolution by roughly $16\times$ while preserving expressive capacity. Each projection and convolution is preceded by batch normalization (BN), which standardizes activations to zero mean and unit variance, and a SiLU nonlinearity. A scaled residual connection adds the block input $h^{(l)}$ directly to the output, weighted by $1/\sqrt{2L}$, so that each block learns a correction to the identity mapping rather than reconstructing the full representation---a standard technique for stable training of deep networks.
  • Figure 2: Distance scaling of BB code decoders under circuit-level depolarizing noise. (a--c) Logical error rate per logical qubit per round ($R = d$ rounds) versus physical error rate for the $\llbracket 72, 12, 6 \rrbracket$, $\llbracket 144, 12, 12 \rrbracket$, and $\llbracket 288, 12, 18 \rrbracket$ bivariate bicycle codes, respectively. Cascade achieves lower logical error rates than BP+OSD, Relay, and (where evaluable) Tesseract across all code sizes. (d) Accuracy vs. latency at $p = 0.2\%$ for all three BB codes (timing reported for BB codes is amortized latency). We include the published results of a different ML decoder blue2025machinelearningdecodingcircuitlevel for the $\llbracket 72, 12, 6 \rrbracket$ and $\llbracket 144, 12, 12 \rrbracket$ codes. For the $\llbracket 288, 12, 18 \rrbracket$ code, we are unable to evaluate Tesseract due to its computational cost; BP+OSD is similarly intractable at lower noise levels.
  • Figure 2: Architectural ablation on surface codes. (a) Architectural inductive biases visualized on 2D syndrome grids. Convolution: Arrows from two different positions show identical directional patterns (colors encode direction-specific learned weights), demonstrating translation equivariance combined with learned anisotropy. Local attention: Arrow thickness represents learned attention weights; different thickness patterns at different positions show the architecture is position-dependent rather than translation-equivariant. Full attention: Global all-to-all connectivity with position-dependent learned weights. (b) Logical error rate versus training compute (PFLOPs) for three architectures with fixed depth ($L=8$) and width ($H=256$) at distance $d=15$, evaluated at physical error rate $p=8\%$. Convolution achieves the lowest error rate, while local attention and full attention saturate at higher error rates despite equal or greater training compute. Existing decoders (MWPM, Correlated MWPM, Tesseract) shown for reference.
  • Figure 3: Distance scaling of surface code decoders at $p = 0.2\%$ under circuit-level depolarizing noise. (a) Logical error rate per round versus code distance $d$ for MWPM, correlated MWPM, Tesseract, and Cascade. All decoders exhibit exponential error suppression ($P_L \propto \Lambda^{-\lfloor(d+1)/2\rfloor}$), with Cascade and Tesseract achieving the steepest slopes. (b) Accuracy--latency trade-off across code distances (GPU inference on NVIDIA H200). Amortized latency (dashed lines, batched inference) is significantly lower than the single-shot latency (solid lines, unbatched inference). (c) Error suppression factor $\Lambda$ extracted from exponential fits to panel (a).
  • ...and 4 more figures