LaMoS: Enabling Efficient Large Number Modular Multiplication through SRAM-based CiM Acceleration
Haomin Li, Fangxin Liu, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan
TL;DR
LaMoS tackles the challenge of efficient large-number modular multiplication for privacy-preserving cryptography by rethinking SRAM-based computing-in-memory with MAC-capable macros and Barrett's algorithm. The method segments large-bitwidth multiplications into 8-bit workloads, maps them across multiple CiM macros, and uses a workload-grouping optimization to scale to higher bit-widths with reduced idle cycles. Empirical results show substantial gains over prior SRAM-based CiM designs, including up to a 3× improvement in latency×area and strong scaling up to 2048-bit operands, while maintaining practical area and latency. The work enables near-memory acceleration of ECC/RSA-based privacy applications, reducing data movement and latency in large-number modular arithmetic.
Abstract
Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a $7.02\times$ speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.
