PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption
Weihang Tan, Sin-Wei Chiu, Antian Wang, Yingjie Lao, Keshab K. Parhi
TL;DR
PaReNTT tackles the challenge of ultra-low-latency long polynomial multiplication for HE by marrying CRT-based residue arithmetic with a feed-forward, two-parallel NTT/iNTT architecture. It introduces special CRT-friendly primes q_i of the form q_i = 2^v − β_i with few signed power-of-two terms, enabling shift-add implementations that greatly reduce area and power while preserving speed. The framework includes optimized pre-processing (residual coefficient computation), parallel residual-domain computations, and an efficient inverse CRT post-processing that decomposes large multiplications into smaller modular reductions, achieving substantial latency and throughput gains over prior CRT-based designs. Experimental FPGA results on n = 4096 with 180-bit moduli demonstrate high clock rates (≈240 MHz), reduced area and power, and significant latency reductions, making PaReNTT well-suited for high-sample-rate HE workloads. The approach provides a scalable path to support larger word-lengths and more CRT factors, with potential applicability to BFV, BGV, and CKKS in hardware-software co-design contexts.
Abstract
High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feed-forward architectures can be pipelined at arbitrary levels.
