Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels

Ghassem Jaberipur; Bardia Nadimi; Jeong-A Lee

Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels

Ghassem Jaberipur, Bardia Nadimi, Jeong-A Lee

TL;DR

The paper addresses the dynamic-range and speed imbalance in residue-number systems by replacing the problematic $(2^{2n}+1)$ channel with two complex conjugates $(2^n\pm j)$, preserving the overall DR and enabling balanced, low-width residue channels. It develops a full LUT-based arithmetic chain for modulo $(2^n\pm j)$, including residue generators, adders, multipliers, and a cost-free reverse converter to modulo $(2^{2n}+1)$, and validates the approach with FPGA experiments on a Spartan-7S100 showing DR enhancements of about $70\%$ and practical coverage of 32-bit numbers. The contributions include detailed LUT realizations, a constructive reverse-conversion scheme based on the New CRT, and a performance-focused comparison against traditional moduli sets, demonstrating faster, lower-power arithmetic for potential DNN accelerators. Overall, the work provides a hardware-friendly pathway to high-precision, energy-efficient RNS computation using complex-number moduli, with tangible benefits for CNN/DNN hardware cores and other modular arithmetic workloads.

Abstract

Augmenting the balanced residue number system moduli-set $\{m_1=2^n,m_2=2^n-1,m_3=2^n+1\}$, with the co-prime modulo $m_4=2^{2n}+1$, increases the dynamic range (DR) by around 70%. The Mersenne form of product $m_2 m_3 m_4=2^{4n}-1$, in the moduli-set $\{m_1,m_2,m_3,m_4\}$, leads to a very efficient reverse convertor, based on the New Chinese remainder theorem. However, the double bit-width of the m_4 residue channel is counter-productive and jeopardizes the speed balance in $\{m_1,m_2,m_3\}$. Therefore, we decompose $m_4$ to two complex-number n-bit moduli $2^n\pm\sqrt{-1}$, which preserves the DR and the co-primality across the augmented moduli set. The required forward modulo-$(2^{2n}+1)$ to moduli-$(2^n\pm\sqrt{-1}) $conversion, and the reverse are immediate and cost-free. The proposed unified moduli-$(2^n\pm\sqrt{-1})$ adder and multiplier, are tested and synthesized using Spartan 7S100 FPGA. The 6-bit look-up tables (LUT), therein, promote the LUT realizations of adders and multipliers, for $n=5$, where the DR equals $2^{25}-2^5$. However, the undertaken experiments show that to cover all the 32-bit numbers, the power-of-two channel $m_1$ can be as wide as 12 bits with no harm to the speed balance across the five moduli. The results also show that the moduli-$(2^5\pm\sqrt{-1})$ add and multiply operations are advantageous vs. moduli-$(2^5\pm1)$ in speed, cost, and energy measures and collectively better than those of modulo-$(2^{10}+1)$.

Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels

TL;DR

The paper addresses the dynamic-range and speed imbalance in residue-number systems by replacing the problematic

channel with two complex conjugates

, preserving the overall DR and enabling balanced, low-width residue channels. It develops a full LUT-based arithmetic chain for modulo

, including residue generators, adders, multipliers, and a cost-free reverse converter to modulo

, and validates the approach with FPGA experiments on a Spartan-7S100 showing DR enhancements of about

and practical coverage of 32-bit numbers. The contributions include detailed LUT realizations, a constructive reverse-conversion scheme based on the New CRT, and a performance-focused comparison against traditional moduli sets, demonstrating faster, lower-power arithmetic for potential DNN accelerators. Overall, the work provides a hardware-friendly pathway to high-precision, energy-efficient RNS computation using complex-number moduli, with tangible benefits for CNN/DNN hardware cores and other modular arithmetic workloads.

Abstract

Augmenting the balanced residue number system moduli-set

, with the co-prime modulo

, increases the dynamic range (DR) by around 70%. The Mersenne form of product

, in the moduli-set

, leads to a very efficient reverse convertor, based on the New Chinese remainder theorem. However, the double bit-width of the m_4 residue channel is counter-productive and jeopardizes the speed balance in

. Therefore, we decompose

to two complex-number n-bit moduli

, which preserves the DR and the co-primality across the augmented moduli set. The required forward modulo-

to moduli-

conversion, and the reverse are immediate and cost-free. The proposed unified moduli-

adder and multiplier, are tested and synthesized using Spartan 7S100 FPGA. The 6-bit look-up tables (LUT), therein, promote the LUT realizations of adders and multipliers, for

, where the DR equals

. However, the undertaken experiments show that to cover all the 32-bit numbers, the power-of-two channel

can be as wide as 12 bits with no harm to the speed balance across the five moduli. The results also show that the moduli-

add and multiply operations are advantageous vs. moduli-

in speed, cost, and energy measures and collectively better than those of modulo-

Paper Structure (10 sections, 12 equations, 8 figures, 8 tables)

This paper contains 10 sections, 12 equations, 8 figures, 8 tables.

Introduction
A Background on General RNS
Modulo-$(2^n\pm j)$ Arithmetic
Modulo-$(2^n\pm j)$ residue generator
Modulo-$(2^n\pm j)$ adders
LUT realization of the modulo-$(2^n\mp j)$ adders
Modulo-($2^n\pm j$) multipliers
LUT realization of $|X\times Y|_{2^n\mp j}$
Immediate and cost-free Reverse moduli-$(2^n\pm j)$ to modulo-$(2^{2n}+1)$ convertor
Evaluation And Comparison

Figures (8)

Figure 1: Proposed complex-number modulo Adder diagram.
Figure 2: Proposed complex-number modulo Multiplier diagram.
Figure 3: Complex-number Modulo vs modulo-$(2^{2n}+1)$ adder: Delay comparison.
Figure 4: Complex-number Modulo vs modulo-$(2^{2n}+1)$ adder: Power comparison.
Figure 5: Complex-number Modulo vs modulo-$(2^{2n}+1)$ adder: Area comparison.
...and 3 more figures

Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels

TL;DR

Abstract

Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels

Authors

TL;DR

Abstract

Table of Contents

Figures (8)