GPU Implementations for Midsize Integer Addition and Multiplication

Cosmin E. Oancea; Stephen M. Watt

GPU Implementations for Midsize Integer Addition and Multiplication

Cosmin E. Oancea, Stephen M. Watt

TL;DR

This work demonstrates that big-integer arithmetic for midsize values can be effectively implemented on GPUs using high-level languages, without resorting to low-level device-specific tricks. It compares simple, scalable approaches for addition and both quadratic and FFT-based multiplication, showing that FFT-based multiplication can outperform classical methods for inputs fitting within a single CUDA block. The paper provides detailed CUDA and Futhark implementations and evaluates them against Nvidia’s cgbn library, finding notable speedups for larger sizes and documenting Futhark compiler bottlenecks related to sequentialization and memory mapping. Overall, the study highlights practical pathways for integrating multi-precision arithmetic into parallel GPU programs and identifies concrete compiler improvements to close performance gaps. The results have implications for cryptography and computer algebra applications where midsize integers are common, suggesting that high-level GPU programming models can yield competitive performance with careful design.

Abstract

This paper explores practical aspects of using a high-level functional language for GPU-based arithmetic on ``midsize'' integers. By this we mean integers of up to about a quarter million bits, which is sufficient for most practical purposes. The goal is to understand whether it is possible to support efficient nested-parallel programs with a small, flexible code base. We report on GPU implementations for addition and multiplication of integers that fit in one CUDA block, thus leveraging temporal reuse from scratchpad memories. Our key contribution resides in the simplicity of the proposed solutions: We recognize that addition is a straightforward application of scan, which is known to allow efficient GPU implementation. For quadratic multiplication we employ a simple work-partitioning strategy that offers good temporal locality. For FFT multiplication, we efficiently map the computation in the domain of integral fields by finding ``good'' primes that enable almost-full utilization of machine words. In comparison, related work uses complex tiling strategies -- which feel too big a hammer for the job -- or uses the computational domain of reals, which may degrade the magnitude of the base in which the computation is carried. We evaluate the performance in comparison to the state-of-the-art CGBN library, authored by NvidiaLab, and report that our CUDA prototype outperforms CGBN for integer sizes higher than 32K bits, while offering comparable performance for smaller sizes. Moreover, we are, to our knowledge, the first to report that FFT multiplication outperforms the classical one on the larger sizes that still fit in a CUDA block. Finally, we examine Futhark's strengths and weaknesses for efficiently supporting such computations and find out that a compiler pass aimed at efficient sequentialization of excess parallelism would significantly improve performance.

GPU Implementations for Midsize Integer Addition and Multiplication

TL;DR

Abstract

Paper Structure (25 sections, 8 equations, 10 figures, 2 tables)

This paper contains 25 sections, 8 equations, 10 figures, 2 tables.

Introduction
Related Work
Contributions
Outline
Integer Addition
Classical, Quadratic Multiplication
Key Insights
Tiling Approach.
Load Balanced Partitioning of the Result Across Threads.
Birds-Eye View of Implementation
Implementing The Convolution
FFT-Based Integer Multiplication
Construction of Integer Prime Fields for DFFT
Straightforward Acceleration of Cooley-Tukey Algorithm
Futhark's Strengths and Weaknesses
...and 10 more sections

Figures (10)

Figure 1: Sequential and parallel procedures for addition a+b base $2^8$: linear vs log time.
Figure 2: Futhark pseudocode for performing a batch of IPB additions of big numbers, each represented as an array of M$32$-bit unsigned integers. Function badd is supposed to be mapped at CUDA-block level (where arrays are mapped to scratchpad memory). Efficient sequentialization is not shown, albeit it is critical for good performance.
Figure 3: Sketch of a simple CUDA kernel for the tiled version of quadratic multiplication.
Figure 4: A load-balanced embarrassingly parallel partitioning is to assign thread $0$ to compute $C_0$ and $C_{M-1}$, thread $1$ to compute $C_1$ and $C_{M-2}$, thread $2$ to compute $C_2$ and $C_{M-3}$, and so on. All threads perform a total $M$ multiply-fused add operations.
Figure 5: Main CUDA wrapper function that computes quadratic integer multiplication.
...and 5 more figures

GPU Implementations for Midsize Integer Addition and Multiplication

TL;DR

Abstract

GPU Implementations for Midsize Integer Addition and Multiplication

Authors

TL;DR

Abstract

Table of Contents

Figures (10)