ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads
Fangxin Liu, Haomin Li, Zongwu Wang, Bo Zhang, Mingzhe Zhang, Shoumeng Yan, Li Jiang, Haibing Guan
TL;DR
This paper tackles the high hardware cost of LUT-based large-number modular reduction by introducing ALLMod, a hybrid approach that fuses LUT-based and iterative methods into a balanced-workload template. Through formal workload splitting, a templated design, and an automatic design-space search, ALLMod achieves substantial area efficiency gains while preserving throughput, including up to 3× improvements at $n=8{,}192$ and notable BRAM/adders-subtractors reductions. The method enables Pareto-optimal designs under user-specified latency and area constraints, with practical validation on an FPGA prototype at $200$ MHz. The work holds practical impact for privacy-centric cryptographic accelerators (HE, ZKP) by enabling scalable, area-efficient large-number modular reductions on reconfigurable hardware.
Abstract
Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage. In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65\times$ and $3\times$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.
