Table of Contents
Fetching ...

FPGA-based Distributed Union-Find Decoder for Surface Codes

Namitha Liyanage, Yue Wu, Siona Tagare, Lin Zhong

TL;DR

A distributed version of the Union-Find (UF) decoder is reported that exploits parallel computing resources for further speedup and has a sublinear average time complexity with regard to <inline-formula><tex-math notation="LaTeX">$d$</tex-math></inline-formula> parallel computing resources.

Abstract

A fault-tolerant quantum computer must decode and correct errors faster than they appear to prevent exponential slowdown due to error correction. The Union-Find (UF) decoder is promising with an average time complexity slightly higher than $O(d^3)$. We report a distributed version of the UF decoder that exploits parallel computing resources for further speedup. Using an FPGA-based implementation, we empirically show that this distributed UF decoder has a sublinear average time complexity with regard to $d$, given $O(d^3)$ parallel computing resources. The decoding time per measurement round decreases as $d$ increases, the first time for a quantum error decoder. The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure. Using a Xilinx VCU129 FPGA, we successfully implement $d$ up to 21 with an average decoding time of 11.5 ns per measurement round under 0.1\% phenomenological noise, and 23.7 ns for $d=17$ under equivalent circuit-level noise. This performance is significantly faster than any existing decoder implementation. Furthermore, we show that Helios can optimize for resource efficiency by decoding $d=51$ on a Xilinx VCU129 FPGA with an average latency of 544ns per measurement round.

FPGA-based Distributed Union-Find Decoder for Surface Codes

TL;DR

A distributed version of the Union-Find (UF) decoder is reported that exploits parallel computing resources for further speedup and has a sublinear average time complexity with regard to <inline-formula><tex-math notation="LaTeX"></tex-math></inline-formula> parallel computing resources.

Abstract

A fault-tolerant quantum computer must decode and correct errors faster than they appear to prevent exponential slowdown due to error correction. The Union-Find (UF) decoder is promising with an average time complexity slightly higher than . We report a distributed version of the UF decoder that exploits parallel computing resources for further speedup. Using an FPGA-based implementation, we empirically show that this distributed UF decoder has a sublinear average time complexity with regard to , given parallel computing resources. The decoding time per measurement round decreases as increases, the first time for a quantum error decoder. The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure. Using a Xilinx VCU129 FPGA, we successfully implement up to 21 with an average decoding time of 11.5 ns per measurement round under 0.1\% phenomenological noise, and 23.7 ns for under equivalent circuit-level noise. This performance is significantly faster than any existing decoder implementation. Furthermore, we show that Helios can optimize for resource efficiency by decoding on a Xilinx VCU129 FPGA with an average latency of 544ns per measurement round.
Paper Structure (42 sections, 12 figures, 2 tables, 8 algorithms)

This paper contains 42 sections, 12 figures, 2 tables, 8 algorithms.

Figures (12)

  • Figure 1: (a) : Rotated CSS surface code ($d=5$), a commonly used type of surface code. The white circles are data qubits and the black are the Z-type and X-type ancillas. (b) and (c): Measurement circuit of Z-type and X-type ancillas. Excluding the ancillas in the border, each Z-type and X-type ancilla interacts with 4 adjacent data qubits.
  • Figure 2: (a) : An example syndrome of Z stabilizers for $d=5$ surface code with 5 rounds of measurements. The syndrome contains an isolated X-error (round 1), an isolated measurement error (rounds 1 and 2), a chain of two X errors (round 3), and a chain containing X errors and measurement errors spanning multiple measurement rounds (rounds 3 and 4). (b) : Phenomenological noise decoding graph with defect vertices marked red for the syndrome in (a). (c) : Modification of decoding graph from phenomenological noise to circuit level shown only for 8 adjacent vertices. Extra edges in the circuit-level noise decoding graph are shown in blue. The thick blue edge represents a hook error and others represent X-errors spanning two measurement rounds.
  • Figure 3: Helios architecture for d=5 surface code for 5 measurement rounds for phenomenological noise model. As d=5 surface code has 12 ancilla qubits of Z-type, Helios contains a 12x5 PE array. PE $n$ indicates PE with $v.id=n$. Not all links from the controller to PEs and all $v.id$s are shown in the figure. The architecture for circuit-level noise has additional links between PEs corresponding to the additional edges in the decoding graph of circuit-level noise
  • Figure 4: The bottom left corner of the PE array shown in \ref{['fig:pe_array']}. Only part of the logic and memory inside PE 1 is shown: growth (S3) is per edge and is stored in the PE with lower $id$. grow logic (in brown) calculates the updated growth value. edge_busy (in green) is per adjacent PE and is used to calculate $v.$busy.
  • Figure 5: An example figure showing how the FPGA implementation groups four nearby defect measurements into a single cluster in eight cycles. (a) Each defect measurement is mapped to a PE and initially, the four defect measurements have $v.id=1,2,3,4$, $v.cid = v.id$, $v.st\_odd=v.odd=1$. (b) The first growth cycle results in fully grown edges between {1,3}, {1,4} and {2,4}. (c) During merging, PEs 3 and 4 set their $v.cid$ as 1 and set their parents to 1 (shown with orange arrows). (d) In the next cycle, PE 1 calculates the parity of the subtree rooted at 1 (PEs 1, 3, 4) while PE 2 updates its $v.cid$ and parent. (e,f) This results in an update of $v.st\_odd$ of subtrees rooted at 4 and 1 in the next two cycles. Simultaneously, the root node (PE 1) updates the parity of the cluster ($v.odd = 0$). (g) $v.odd$ is propagated to all PEs in the cluster in two cycles, and no change occurring in the 8th cycle tells the controller to advance the stage.
  • ...and 7 more figures