Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

Yuxuan Zhang; Hua Guo; Chen Chen; Yewei Guan; Xiyong Zhang; Zhenyu Guan

Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

Yuxuan Zhang, Hua Guo, Chen Chen, Yewei Guan, Xiyong Zhang, Zhenyu Guan

TL;DR

This work addresses the bottleneck of Montgomery modular multiplication (MMM) for large moduli in public-key cryptography by introducing Different-Radix MMM (DRMMM), which decouples the radix used for intermediate products $a_iB$ and the quotient terms $\hat{q}$ to enable cross-iteration pipelining of the quotient computation. The authors develop a concrete cross-iteration pipeline, prove the correctness of the mixed-radix approach, and analyze latency to show a reduced critical path for large numbers of iterations. They also design a hardware architecture optimized for FPGA, including a LUT-based precomputed encoding layer and an advanced LUT-based 6-to-3 compressor, to realize high-throughput MMM with triplet intermediate representation $\hat{Z}=(\hat{Z}^0,\hat{Z}^1,\hat{Z}^2)$. Empirical results on a Virtex-7 FPGA demonstrate up to 38.3% reduction in output latency and a 34.04% reduction in ATP compared with state-of-the-art designs, validating DRMMM's effectiveness for large-modulus MMM in hardware applications.

Abstract

Montgomery modular multiplication is widely-used in public key cryptosystems (PKC) and affects the efficiency of upper systems directly. However, modulus is getting larger due to the increasing demand of security, which results in a heavy computing cost. High-performance implementation of Montgomery modular multiplication is urgently required to ensure the highly-efficient operations in PKC. However, existing high-speed implementations still need a large amount redundant computing to simplify the intermediate result. Supports to the redundant representation is extremely limited on Montgomery modular multiplication. In this paper, we propose an efficient parallel variant of iterative Montgomery modular multiplication, called DRMMM, that allows the quotient can be computed in multiple iterations. In this variant, terms in intermediate result and the quotient in each iteration are computed in different radix such that computation of the quotient can be pipelined. Based on proposed variant, we also design high-performance hardware implementation architecture for faster operation. In the architecture, intermediate result in every iteration is denoted as three parts to free from redundant computations. Finally, to support FPGA-based systems, we design operators based on FPGA underlying architecture for better area-time performance. The result of implementation and experiment shows that our method reduces the output latency by 38.3\% than the fastest design on FPGA.

Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

TL;DR

and the quotient terms

to enable cross-iteration pipelining of the quotient computation. The authors develop a concrete cross-iteration pipeline, prove the correctness of the mixed-radix approach, and analyze latency to show a reduced critical path for large numbers of iterations. They also design a hardware architecture optimized for FPGA, including a LUT-based precomputed encoding layer and an advanced LUT-based 6-to-3 compressor, to realize high-throughput MMM with triplet intermediate representation

. Empirical results on a Virtex-7 FPGA demonstrate up to 38.3% reduction in output latency and a 34.04% reduction in ATP compared with state-of-the-art designs, validating DRMMM's effectiveness for large-modulus MMM in hardware applications.

Abstract

Paper Structure (22 sections, 3 theorems, 28 equations, 10 figures, 1 table, 3 algorithms)

This paper contains 22 sections, 3 theorems, 28 equations, 10 figures, 1 table, 3 algorithms.

Introduction
Preliminary
Montgomery Modular Multiplication
Parallel Compression
Different-Radix MMM variant
Overview of Proposed Variant DRMMM
Cross-iteration Pipeline to Compute $\hat{q}$
Correctness Proof and Latency Analysis
Correctness Proof
Latency Analysis
Hardware Architecture for DRMMM
Overview of Proposed Hardware Architecture
LUT-Based High-Performance Operators on FPGA
LUT-Based Precomputed Encoding Layer
Advanced LUT-Based 6-to-3 Compressor
...and 7 more sections

Key Result

theorem 1

Let $d$ be the number of iterations and $2^k$ be radix of MMM. Given the input $A,B,M$, $\sum\limits_{j=0}^{d-1} {q_j}r^j\ \text{mod}\ 2^{|M|}$ in iterative MMM is a constant independent of $k$. The constant equals to $ABM'\ \text{mod}\ 2^{|M|}$, where $M'=-M^{-1}\ \text{mod}\ 2^{|M|}$.

Figures (10)

Figure 1: Compressors based on CSA.
Figure 2: In-iteration data dependencies in iterative MMM.
Figure 3: Comparison of $Z$ in each iteration.
Figure 4: Pipelined computation of $\hat{q}$.
Figure 5: Registers in the pipeline.
...and 5 more figures

Theorems & Definitions (7)

definition 1: Cross-iteration dependence degree
theorem 1: Consistency of $q$
proof
theorem 2: shifting validity
proof
theorem 3: output correctness
proof

Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

TL;DR

Abstract

Efficient and Flexible Differet-Radix Montgomery Modular Multiplication for Hardware Implementation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (7)