Accelerating Elliptic Curve Point Additions on Versal AI Engine for Multi-scalar Multiplication
Ayumi Ohno, Kotaro Shimamura, Shinya Takamaeda-Yamazaki
TL;DR
This paper targets accelerating multi-scalar multiplication (MSM) on Versal ACAP by designing an AIE-centric PADD unit. It introduces a mixed PADD approach, leveraging 377-bit operands represented as 16 limbs of 25 bits and Barrett reduction, while performing multiplications and carry propagation primarily on the AIEs; four spatial mappings and multiple parallelism levels are evaluated to maximize throughput and minimize latency. Real- and cycle-accurate experiments show the design achieves 67.0 M task/s throughput and 1.05 µs latency per task, yielding 568× speedup over CPU but still trailing FPGA/ASIC-based MSM approaches like BSTMSM. The results underscore the benefits of performing carry propagation on the AIE, the potential of Montgomery representations for further reductions, and the tradeoffs between different tile mappings, informing future heterogeneous-architecture MSM accelerators with practical implications for zk-SNARK workloads.
Abstract
Multi-scalar multiplication (MSM) is crucial in cryptographic applications and computationally intensive in zero-knowledge proofs. MSM involves accumulating the products of scalars and points on an elliptic curve over a 377-bit modulus, and the Pippenger algorithm converts MSM into a series of elliptic curve point additions (PADDs) with high parallelism. This study investigates accelerating MSM on the Versal ACAP platform, an emerging hardware that employs a spatial architecture integrating 400 AI Engines (AIEs) with programmable logic and a processing system. AIEs are SIMD-based VLIW processors capable of performing vector multiply-accumulate operations, making them well-suited for multiplication-heavy workloads in PADD. Unlike simpler multiplication tasks in previous studies, cryptographic computations also require complex operations such as carry propagation. These operations necessitate architecture-aware optimizations, including intra-core dedicated coding style to fully exploit VLIW capabilities and inter-core strategy for spatial task mapping. We propose various optimizations to accelerate PADDs, including (1) algorithmic optimizations for carry propagation employing a carry-save-like technique to exploit VLIW and SIMD capabilities and (2) a comparison of four distinct spatial mappings to enhance intra- and inter-task parallelism. Our approach achieves a computational efficiency that utilizes 50.2% of the theoretical memory bandwidth and provides 568 speedup over the integrated CPU on the AIE evaluation board.
