Table of Contents
Fetching ...

ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads

Fangxin Liu, Haomin Li, Zongwu Wang, Bo Zhang, Mingzhe Zhang, Shoumeng Yan, Li Jiang, Haibing Guan

TL;DR

This paper tackles the high hardware cost of LUT-based large-number modular reduction by introducing ALLMod, a hybrid approach that fuses LUT-based and iterative methods into a balanced-workload template. Through formal workload splitting, a templated design, and an automatic design-space search, ALLMod achieves substantial area efficiency gains while preserving throughput, including up to 3× improvements at $n=8{,}192$ and notable BRAM/adders-subtractors reductions. The method enables Pareto-optimal designs under user-specified latency and area constraints, with practical validation on an FPGA prototype at $200$ MHz. The work holds practical impact for privacy-centric cryptographic accelerators (HE, ZKP) by enabling scalable, area-efficient large-number modular reductions on reconfigurable hardware.

Abstract

Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage. In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to $1.65\times$ and $3\times$ improvements in area efficiency over conventional LUT-based methods for bit-widths of $128$ and $8,192$, respectively.

ALLMod: Exploring $\underline{\mathbf{A}}$rea-Efficiency of $\underline{\mathbf{L}}$UT-based $\underline{\mathbf{L}}$arge Number $\underline{\mathbf{Mod}}$ular Reduction via Hybrid Workloads

TL;DR

This paper tackles the high hardware cost of LUT-based large-number modular reduction by introducing ALLMod, a hybrid approach that fuses LUT-based and iterative methods into a balanced-workload template. Through formal workload splitting, a templated design, and an automatic design-space search, ALLMod achieves substantial area efficiency gains while preserving throughput, including up to 3× improvements at and notable BRAM/adders-subtractors reductions. The method enables Pareto-optimal designs under user-specified latency and area constraints, with practical validation on an FPGA prototype at MHz. The work holds practical impact for privacy-centric cryptographic accelerators (HE, ZKP) by enabling scalable, area-efficient large-number modular reductions on reconfigurable hardware.

Abstract

Modular arithmetic, particularly modular reduction, is widely used in cryptographic applications such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). High-bit-width operations are crucial for enhancing security; however, they are computationally intensive due to the large number of modular operations required. The lookup-table-based (LUT-based) approach, a ``space-for-time'' technique, reduces computational load by segmenting the input number into smaller bit groups, pre-computing modular reduction results for each segment, and storing these results in LUTs. While effective, this method incurs significant hardware overhead due to extensive LUT usage. In this paper, we introduce ALLMod, a novel approach that improves the area efficiency of LUT-based large-number modular reduction by employing hybrid workloads. Inspired by the iterative method, ALLMod splits the bit groups into two distinct workloads, achieving lower area costs without compromising throughput. We first develop a template to facilitate workload splitting and ensure balanced distribution. Then, we conduct design space exploration to evaluate the optimal timing for fusing workload results, enabling us to identify the most efficient design under specific constraints. Extensive evaluations show that ALLMod achieves up to and improvements in area efficiency over conventional LUT-based methods for bit-widths of and , respectively.

Paper Structure

This paper contains 20 sections, 6 equations, 5 figures, 1 table, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of ALLMod to the existing modular reduction methods and ALLMod design overview.
  • Figure 2: Comparison of area efficiency and latency between LUT-based method and iterative-based method.
  • Figure 3: ALLMod Template for balanced workload. Part ① and part ② support lookup and accumulation for LUT-based method. Part ③ supports serial subtraction for iterative-based method. Part ④ and part ⑤ are designed for fusing and adjusting the results.
  • Figure 4: Pareto optimal schemes of ALLMod for various throughputs and bit-widths. The optimal schemes with high area efficiency gradually converge as the bit-width increases.
  • Figure 5: Visualization of searched schemes for various bit-width at maximum throughput $MaxTP$. Red points indicate the feasible schemes and green points indicate the Pareto optimal schemes.