Low-latency machine learning FPGA accelerator for multi-qubit-state discrimination

Pradeep Kumar Gautam; Shantharam Kalipatnapu; Shankaranarayanan H; Ujjawal Singhal; Benjamin Lienhard; Vibhor Singh; Chetan Singh Thakur

Low-latency machine learning FPGA accelerator for multi-qubit-state discrimination

Pradeep Kumar Gautam, Shantharam Kalipatnapu, Shankaranarayanan H, Ujjawal Singhal, Benjamin Lienhard, Vibhor Singh, Chetan Singh Thakur

TL;DR

The paper tackles the challenge of ultra-low-latency, high-fidelity multi-qubit state discrimination for frequency-multiplexed readout on an FPGA platform. It develops an end-to-end NN-based discriminator trained with quantization-aware training and deployed via the FINN-R dataflow flow to a RFSoC ZCU111, achieving latencies below 50 ns for five qubits while preserving fidelity ($F_{ m GM}\approx 0.90$). It demonstrates architectural strategies to maximize FPGA parallelism (input segmentation, Concat-free dataflow) and compares with SVM/matched-filter baselines, showing favorable latency with a larger parameter budget. The work provides a scalable, automated path to integrate NN-based readout into quantum control stacks, with relevance for improving QEC cycle rates and robustness to readout crossticks.

Abstract

Measuring a qubit state is a fundamental yet error-prone operation in quantum computing. These errors can arise from various sources, such as crosstalk, spontaneous state transitions, and excitations caused by the readout pulse. Here, we utilize an integrated approach to deploy neural networks onto field-programmable gate arrays (FPGA). We demonstrate that implementing a fully connected neural network accelerator for multi-qubit readout is advantageous, balancing computational complexity with low latency requirements without significant loss in accuracy. The neural network is implemented by quantizing weights, activation functions, and inputs. The hardware accelerator performs frequency-multiplexed readout of five superconducting qubits in less than 50 ns on a radio frequency system on chip (RFSoC) ZCU111 FPGA, marking the advent of RFSoC-based low-latency multi-qubit readout using neural networks. These modules can be implemented and integrated into existing quantum control and readout platforms, making the RFSoC ZCU111 ready for experimental deployment.

Low-latency machine learning FPGA accelerator for multi-qubit-state discrimination

TL;DR

). It demonstrates architectural strategies to maximize FPGA parallelism (input segmentation, Concat-free dataflow) and compares with SVM/matched-filter baselines, showing favorable latency with a larger parameter budget. The work provides a scalable, automated path to integrate NN-based readout into quantum control stacks, with relevance for improving QEC cycle rates and robustness to readout crossticks.

Abstract

Paper Structure (12 sections, 3 equations, 6 figures, 4 tables)

This paper contains 12 sections, 3 equations, 6 figures, 4 tables.

Introduction
Superconducting Qubit Readout
Low-Latency Neural Network State Discriminator
Neural Network Model Optimization and Quantization
FPGA Acceleration
Results and Discussion
Performance and Resource Utilization
Comparison
Conclusion
NN-Accelerator Design Methodology
FINN-R Flow
SVM-based Qubit-State Discriminator

Figures (6)

Figure 1: Block diagram of a superconducting quantum processor interfacing with an FPGA system for qubit control and readout. The FPGA is part of the ZCU111 RFSoC evaluation kit. The quantum processor, which contains five superconducting qubits, is housed in a dilution refrigerator (for more details on the quantum processor and experimental setup, consult Ref. lienhard2022deep). RF digital-to-analog converter (RF-DAC) process control signals for the qubits. The readout signal combines all five readout tones and is transmitted through a single feedline. This signal is first amplified by, among others, a traveling-wave parametric amplifier (TWPA) and then digitized by an RF analog-to-digital converter (RF-ADC). The RF-ADC converts the frequency-multiplexed readout signal into in-phase ($I$) and quadrature ($Q$) components. A machine learning accelerator for qubit-state discrimination subsequently processes these components. Based on the inferred qubit states, a feedback signal may be generated to drive the subsequent control pulse generation logic.
Figure 2: Characteristics of qubit-specific single-shot readout traces. Panels (a-e) show the state discrimination of integrated single-shot traces for all five superconducting qubits, with red (blue) markers indicating the inferred ground (excited) state of each qubit. Panel (f) presents the cross-fidelity matrix, calculated using matched filters to infer the qubit states.
Figure 3: Effects of neural network (NN) architecture parameters on readout fidelity. (a) Fidelity for various NN architectures with input feature sizes of $1024$ and $512$. The horizontal axis represents the dimensions of the hidden layers. (b) Impact on geometric mean fidelity $F_{\rm GM}$ with varying input bit quantization sizes. The labels 'w#a#' indicate the quantization of weights and activations where # indicates the number of bits used for quantization. (c) Effects of mixed quantization of weights and activations on fidelity. (d) Impact of mixed quantization of weights and activations with binarized input. The blue (red) curve shows fidelity variation with activation (weight) bit width for weights quantized to a single bit. (e) Effect of model parameters and depth on readout fidelity. The input, weights, and activations are quantized to $4$, $2$, and $2$ bits, respectively. (f) Cross-fidelity matrix for the quantized NN. For panels (b), (c), (d), and (f), the NN architecture is $512 \times 64 \times 5$.
Figure 4: Quantized neural network (QNN) archetypes. (a) Software model of the QNN $\left(\left(512\times 8\right)\times 8\times 5\right)$ for achieving maximum parallelism. The first hidden layer consists of $8$ equal segments (Seg $1$ to Seg $8$), each containing eight nodes in the Linear layer, followed by batch normalization (BN) and rectified linear units (ReLU). The output of each segment ($1\times 8$) is concatenated using the Concat layer, resulting in a size of $1\times 64$. (b) Fully parallel hardware architecture of the QNN. Each segment of the model shown in panel (a) is implemented as a multi-vector threshold unit, which runs in parallel on hardware. A processing element, consisting of various computation blocks, is shown in the inset. (c) Piecewise layered QNN architecture $256\times 128\times 128\times 128\times 128\times 5$. Each layer processes a segment of the input signal along with the output from the previous layer. The input size for the first layer of the model is $1\times 256$, and subsequent layers receive an input size of $1\times 128$, composed of $1\times 64$ from the previous layer and $1\times 64$ from the input segment. The last section of the input signal is only fed to the last layer, meaning only the last layer contributes to the latency of the network.
Figure 5: Flowchart illustrating the end-to-end process of FINN-R. The double-bordered rectangles indicate modifications made to the default FINN-R flow.
...and 1 more figures

Low-latency machine learning FPGA accelerator for multi-qubit-state discrimination

TL;DR

Abstract

Low-latency machine learning FPGA accelerator for multi-qubit-state discrimination

Authors

TL;DR

Abstract

Table of Contents

Figures (6)