Table of Contents
Fetching ...

LaMoS: Enabling Efficient Large Number Modular Multiplication through SRAM-based CiM Acceleration

Haomin Li, Fangxin Liu, Chenyang Guan, Zongwu Wang, Li Jiang, Haibing Guan

TL;DR

LaMoS tackles the challenge of efficient large-number modular multiplication for privacy-preserving cryptography by rethinking SRAM-based computing-in-memory with MAC-capable macros and Barrett's algorithm. The method segments large-bitwidth multiplications into 8-bit workloads, maps them across multiple CiM macros, and uses a workload-grouping optimization to scale to higher bit-widths with reduced idle cycles. Empirical results show substantial gains over prior SRAM-based CiM designs, including up to a 3× improvement in latency×area and strong scaling up to 2048-bit operands, while maintaining practical area and latency. The work enables near-memory acceleration of ECC/RSA-based privacy applications, reducing data movement and latency in large-number modular arithmetic.

Abstract

Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a $7.02\times$ speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.

LaMoS: Enabling Efficient Large Number Modular Multiplication through SRAM-based CiM Acceleration

TL;DR

LaMoS tackles the challenge of efficient large-number modular multiplication for privacy-preserving cryptography by rethinking SRAM-based computing-in-memory with MAC-capable macros and Barrett's algorithm. The method segments large-bitwidth multiplications into 8-bit workloads, maps them across multiple CiM macros, and uses a workload-grouping optimization to scale to higher bit-widths with reduced idle cycles. Empirical results show substantial gains over prior SRAM-based CiM designs, including up to a 3× improvement in latency×area and strong scaling up to 2048-bit operands, while maintaining practical area and latency. The work enables near-memory acceleration of ECC/RSA-based privacy applications, reducing data movement and latency in large-number modular arithmetic.

Abstract

Barrett's algorithm is one of the most widely used methods for performing modular multiplication, a critical nonlinear operation in modern privacy computing techniques such as homomorphic encryption (HE) and zero-knowledge proofs (ZKP). Since modular multiplication dominates the processing time in these applications, computational complexity and memory limitations significantly impact performance. Computing-in-Memory (CiM) is a promising approach to tackle this problem. However, existing schemes currently suffer from two main problems: 1) Most works focus on low bit-width modular multiplication, which is inadequate for mainstream cryptographic algorithms such as elliptic curve cryptography (ECC) and the RSA algorithm, both of which require high bit-width operations; 2) Recent efforts targeting large number modular multiplication rely on inefficient in-memory logic operations, resulting in high scaling costs for larger bit-widths and increased latency. To address these issues, we propose LaMoS, an efficient SRAM-based CiM design for large-number modular multiplication, offering high scalability and area efficiency. First, we analyze the Barrett's modular multiplication method and map the workload onto SRAM CiM macros for high bit-width cases. Additionally, we develop an efficient CiM architecture and dataflow to optimize large-number modular multiplication. Finally, we refine the mapping scheme for better scalability in high bit-width scenarios using workload grouping. Experimental results show that LaMoS achieves a speedup and reduces high bit-width scaling costs compared to existing SRAM-based CiM designs.

Paper Structure

This paper contains 16 sections, 3 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Performance Comparison with previous works over various bit-widths.
  • Figure 2: Large Number Multiplication Workload Mapping onto the SRAM Macro.
  • Figure 3: Input Timing Flow for 256-bit multiplication with single/multiple SRAM macro(s). Number of SRAM macros is set to 3 in the example.
  • Figure 4: LaMoS Architecture and Dataflow for efficient large number modular multiplication.
  • Figure 5: LaMoS Execution for high bit-width modular multiplication. (a) Naively Scaling. (b) Optimization with workload grouping.
  • ...and 4 more figures