Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU
Naifeng Zhang, Franz Franchetti
TL;DR
This work addresses the prohibitive cost of large-integer arithmetic in FHE and ZKPs by introducing multi-word modular arithmetic (MoMA), a formal framework that decomposes large bit-width operations into native machine-word arithmetic. A rewrite-system-based code generator, implemented in SPIRAL and integrated with NTTX, automatically transforms computations on large data types into sequences of operations on smaller types, enabling GPU-targeted kernels. Empirical results show MoMA-based BLAS kernels outperform state-of-the-art multi-precision libraries by orders of magnitude, while MoMA-based NTTs achieve near-ASIC performance on commodity GPUs across 128–1024 bit widths. The approach promises substantial practical impact by enabling efficient cryptographic kernels at larger bit-widths, with potential to reshape the design of future cryptographic software and hardware co-design.
Abstract
Fully homomorphic encryption (FHE) and zero-knowledge proofs (ZKPs) are emerging as solutions for data security in distributed environments. However, the widespread adoption of these encryption techniques is hindered by their significant computational overhead, primarily resulting from core cryptographic operations that involve large integer arithmetic. This paper presents a formalization of multi-word modular arithmetic (MoMA), which breaks down large bit-width integer arithmetic into operations on machine words. We further develop a rewrite system that implements MoMA through recursive rewriting of data types, designed for compatibility with compiler infrastructures and code generators. We evaluate MoMA by generating cryptographic kernels, including basic linear algebra subprogram (BLAS) operations and the number theoretic transform (NTT), targeting various GPUs. Our MoMA-based BLAS operations outperform state-of-the-art multi-precision libraries by orders of magnitude, and MoMA-based NTTs achieve near-ASIC performance on commodity GPUs.
