SZKP: A Scalable Accelerator Architecture for Zero-Knowledge Proofs
Alhad Daftardar, Brandon Reagen, Siddharth Garg
TL;DR
SZKP addresses the heavy online proving cost of zkSNARKs by delivering a full-chip ASIC accelerator that unifies Dense MSM, Sparse MSM, and NTT modules for Groth16 proof generation. The architecture leverages a scalable PEs, memory banking, and constant-geometry NTTs to achieve high utilization and bandwidth efficiency, delivering 12–86× speedups over GPU-based designs and 3–12× over prior ASICs, while using about half the area in representative configurations. A comprehensive design-space exploration shows where to trade off area, throughput, and bandwidth, demonstrating that full proofs can be accelerated on-chip with manageable power densities. The work suggests a practical path toward scalable, hardware-assisted ZKP deployment in cloud, privacy-preserving applications, and crypto workloads.
Abstract
Zero-Knowledge Proofs (ZKPs) are an emergent paradigm in verifiable computing. In the context of applications like cloud computing, ZKPs can be used by a client (called the verifier) to verify the service provider (called the prover) is in fact performing the correct computation based on a public input. A recently prominent variant of ZKPs is zkSNARKs, generating succinct proofs that can be rapidly verified by the end user. However, proof generation itself is very time consuming per transaction. Two key primitives in proof generation are the Number Theoretic Transform (NTT) and Multi-scalar Multiplication (MSM). These primitives are prime candidates for hardware acceleration, and prior works have looked at GPU implementations and custom RTL. However, both algorithms involve complex dataflow patterns -- standard NTTs have irregular memory accesses for butterfly computations from stage to stage, and MSMs using Pippenger's algorithm have data-dependent memory accesses for partial sum calculations. We present SZKP, a scalable accelerator framework that is the first ASIC to accelerate an entire proof on-chip by leveraging structured dataflows for both NTTs and MSMs. SZKP achieves conservative full-proof speedups of over 400$\times$, 3$\times$, and 12$\times$ over CPU, ASIC, and GPU implementations.
