SZKP: A Scalable Accelerator Architecture for Zero-Knowledge Proofs

Alhad Daftardar; Brandon Reagen; Siddharth Garg

SZKP: A Scalable Accelerator Architecture for Zero-Knowledge Proofs

Alhad Daftardar, Brandon Reagen, Siddharth Garg

TL;DR

SZKP addresses the heavy online proving cost of zkSNARKs by delivering a full-chip ASIC accelerator that unifies Dense MSM, Sparse MSM, and NTT modules for Groth16 proof generation. The architecture leverages a scalable PEs, memory banking, and constant-geometry NTTs to achieve high utilization and bandwidth efficiency, delivering 12–86× speedups over GPU-based designs and 3–12× over prior ASICs, while using about half the area in representative configurations. A comprehensive design-space exploration shows where to trade off area, throughput, and bandwidth, demonstrating that full proofs can be accelerated on-chip with manageable power densities. The work suggests a practical path toward scalable, hardware-assisted ZKP deployment in cloud, privacy-preserving applications, and crypto workloads.

Abstract

Zero-Knowledge Proofs (ZKPs) are an emergent paradigm in verifiable computing. In the context of applications like cloud computing, ZKPs can be used by a client (called the verifier) to verify the service provider (called the prover) is in fact performing the correct computation based on a public input. A recently prominent variant of ZKPs is zkSNARKs, generating succinct proofs that can be rapidly verified by the end user. However, proof generation itself is very time consuming per transaction. Two key primitives in proof generation are the Number Theoretic Transform (NTT) and Multi-scalar Multiplication (MSM). These primitives are prime candidates for hardware acceleration, and prior works have looked at GPU implementations and custom RTL. However, both algorithms involve complex dataflow patterns -- standard NTTs have irregular memory accesses for butterfly computations from stage to stage, and MSMs using Pippenger's algorithm have data-dependent memory accesses for partial sum calculations. We present SZKP, a scalable accelerator framework that is the first ASIC to accelerate an entire proof on-chip by leveraging structured dataflows for both NTTs and MSMs. SZKP achieves conservative full-proof speedups of over 400$\times$, 3$\times$, and 12$\times$ over CPU, ASIC, and GPU implementations.

SZKP: A Scalable Accelerator Architecture for Zero-Knowledge Proofs

TL;DR

Abstract

, 3

, and 12

over CPU, ASIC, and GPU implementations.

Paper Structure (33 sections, 2 equations, 14 figures, 10 tables, 1 algorithm)

This paper contains 33 sections, 2 equations, 14 figures, 10 tables, 1 algorithm.

Introduction
Background
Zero Knowledge Proofs
zkSNARK Protocol Description
MSMs
NTTs and Polynomial Computation
Overall Dataflow of Groth16
The SZKP Architecture
Dense MSM Architecture
Pippenger's algorithm for MSMs
Single PE Design
Scaling to Multiple PEs
Supporting Sparse MSMs
Separate Vs. Shared Hardware
Sparse G2 MSM Optimizations
...and 18 more sections

Figures (14)

Figure 1: Groth16 dataflow. This protocol involves 7 (I)NTTs and 5 MSMs to construct a lightweight proof for the verifier. In our design, software provides us $A(s), B(s),$ and $C(s)$, as well as the witness vector $w$ and ECC points derived from the offline-generated proving key.
Figure 2: Pippenger's algorithm. This example demonstrates computation of window 1, with $P_x$ already computed for window 0. The final result involves doubling $P_y$ 3 times before adding it with $P_x$
Figure 3: Pipeline architecture for a single Dense MSM. Buckets store point addresses. Queue selection policy can be RR, Max-r, or LQ. Writebacks always succeed. Each PE reads from one bank of scalars and points at a time, avoiding memory contention.
Figure 4: PADD utilization across varying window sizes for round robin (RR), longest queue (LQ) and Max-$r$. Max-8 and LQ consistently have more than 90% utilization.
Figure 5: Example Polynomial Computation Pipeline Schedule. This simple schedule assumes an $M \times M$ matrix and $M$ NTT PEs. Slots with solid borders represent an INTT phase, while dashed borders represent an NTT phase. Slots with cross hatches represent column-wise operations, while slots with solid fill represent row-wise operations. Slots with dots represent values that used operands stored on-chip instead of being prefetched. $\omega_I$ are INTT twiddles, $\omega_N$ are NTT twiddles, and X are generators.
...and 9 more figures

SZKP: A Scalable Accelerator Architecture for Zero-Knowledge Proofs

TL;DR

Abstract

SZKP: A Scalable Accelerator Architecture for Zero-Knowledge Proofs

Authors

TL;DR

Abstract

Table of Contents

Figures (14)