gECC: A GPU-based high-throughput framework for Elliptic Curve Cryptography
Qian Xiong, Weiliang Ma, Xuanhua Shi, Yongluan Zhou, Hai Jin, Kaiyi Huang, Haozhou Wang, Zhengru Wang
TL;DR
The paper tackles the bottleneck of ECC performance in throughput-sensitive applications by proposing gECC, a GPU-optimized framework that batches EC operations using Montgomery's trick and employs GAS for batch modular inversion. It combines data-locality aware kernel fusion and multi-level cache management to minimize memory overhead, and it introduces SM2-specific modular reduction optimizations to reduce IMAD instructions. Empirical results on Nvidia A100 show large gains: up to 5.56x for ECDSA verification and 4.94x for ECDH over state-of-the-art GPU systems, with significant improvements in batch PMUL and modular multiplication, and a 1.56x throughput boost in a real blockchain workload. The work demonstrates practical impact for high-throughput crypto services in blockchain, verifiable databases, and secure cloud services, and provides open-source access to the gECC framework.
Abstract
Elliptic Curve Cryptography (ECC) is an encryption method that provides security comparable to traditional techniques like Rivest-Shamir-Adleman (RSA) but with lower computational complexity and smaller key sizes, making it a competitive option for applications such as blockchain, secure multi-party computation, and database security. However, the throughput of ECC is still hindered by the significant performance overhead associated with elliptic curve (EC) operations. This paper presents gECC, a versatile framework for ECC optimized for GPU architectures, specifically engineered to achieve high-throughput performance in EC operations. gECC incorporates batch-based execution of EC operations and microarchitecture-level optimization of modular arithmetic. It employs Montgomery's trick to enable batch EC computation and incorporates novel computation parallelization and memory management techniques to maximize the computation parallelism and minimize the access overhead of GPU global memory. Also, we analyze the primary bottleneck in modular multiplication by investigating how the user codes of modular multiplication are compiled into hardware instructions and what these instructions' issuance rates are. We identify that the efficiency of modular multiplication is highly dependent on the number of Integer Multiply-Add (IMAD) instructions. To eliminate this bottleneck, we propose techniques to minimize the number of IMAD instructions by leveraging predicate registers to pass the carry information and using addition and subtraction instructions (IADD3) to replace IMAD instructions. Our results show that, for ECDSA and ECDH, gECC can achieve performance improvements of 5.56x and 4.94x, respectively, compared to the state-of-the-art GPU-based system. In a real-world blockchain application, we can achieve performance improvements of 1.56x, compared to the state-of-the-art CPU-based system.
