Table of Contents
Fetching ...

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

Naifeng Zhang, Franz Franchetti

TL;DR

This work addresses the prohibitive cost of large-integer arithmetic in FHE and ZKPs by introducing multi-word modular arithmetic (MoMA), a formal framework that decomposes large bit-width operations into native machine-word arithmetic. A rewrite-system-based code generator, implemented in SPIRAL and integrated with NTTX, automatically transforms computations on large data types into sequences of operations on smaller types, enabling GPU-targeted kernels. Empirical results show MoMA-based BLAS kernels outperform state-of-the-art multi-precision libraries by orders of magnitude, while MoMA-based NTTs achieve near-ASIC performance on commodity GPUs across 128–1024 bit widths. The approach promises substantial practical impact by enabling efficient cryptographic kernels at larger bit-widths, with potential to reshape the design of future cryptographic software and hardware co-design.

Abstract

Fully homomorphic encryption (FHE) and zero-knowledge proofs (ZKPs) are emerging as solutions for data security in distributed environments. However, the widespread adoption of these encryption techniques is hindered by their significant computational overhead, primarily resulting from core cryptographic operations that involve large integer arithmetic. This paper presents a formalization of multi-word modular arithmetic (MoMA), which breaks down large bit-width integer arithmetic into operations on machine words. We further develop a rewrite system that implements MoMA through recursive rewriting of data types, designed for compatibility with compiler infrastructures and code generators. We evaluate MoMA by generating cryptographic kernels, including basic linear algebra subprogram (BLAS) operations and the number theoretic transform (NTT), targeting various GPUs. Our MoMA-based BLAS operations outperform state-of-the-art multi-precision libraries by orders of magnitude, and MoMA-based NTTs achieve near-ASIC performance on commodity GPUs.

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

TL;DR

This work addresses the prohibitive cost of large-integer arithmetic in FHE and ZKPs by introducing multi-word modular arithmetic (MoMA), a formal framework that decomposes large bit-width operations into native machine-word arithmetic. A rewrite-system-based code generator, implemented in SPIRAL and integrated with NTTX, automatically transforms computations on large data types into sequences of operations on smaller types, enabling GPU-targeted kernels. Empirical results show MoMA-based BLAS kernels outperform state-of-the-art multi-precision libraries by orders of magnitude, while MoMA-based NTTs achieve near-ASIC performance on commodity GPUs across 128–1024 bit widths. The approach promises substantial practical impact by enabling efficient cryptographic kernels at larger bit-widths, with potential to reshape the design of future cryptographic software and hardware co-design.

Abstract

Fully homomorphic encryption (FHE) and zero-knowledge proofs (ZKPs) are emerging as solutions for data security in distributed environments. However, the widespread adoption of these encryption techniques is hindered by their significant computational overhead, primarily resulting from core cryptographic operations that involve large integer arithmetic. This paper presents a formalization of multi-word modular arithmetic (MoMA), which breaks down large bit-width integer arithmetic into operations on machine words. We further develop a rewrite system that implements MoMA through recursive rewriting of data types, designed for compatibility with compiler infrastructures and code generators. We evaluate MoMA by generating cryptographic kernels, including basic linear algebra subprogram (BLAS) operations and the number theoretic transform (NTT), targeting various GPUs. Our MoMA-based BLAS operations outperform state-of-the-art multi-precision libraries by orders of magnitude, and MoMA-based NTTs achieve near-ASIC performance on commodity GPUs.
Paper Structure (49 sections, 27 equations, 5 figures, 2 tables)

This paper contains 49 sections, 27 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Performance of 256-bit NTT on GPUs and ASIC (lower is better). On NVIDIA GeForce RTX 4090, MoMA-based NTT outperforms state-of-the-art cryptographic acceleration library inbasekar2024icicle running on NVIDIA H100 by an average of 14 times and achieves near-ASIC zhou2024fully performance.
  • Figure 2: Performance of BLAS operations with various input bit-widths on CPU and GPU.
  • Figure 3: Performance of NTT with various input bit-widths on CPUs, GPUs, and ASICs.
  • Figure 4: Performance of $2^{16}$-point NTT with input bit-widths ranging from 128 to 1,024 on CPUs, GPUs, and ASICs.
  • Figure 5: Sensitivity analyses on NTT runtime.