A Nonlinear Hash-based Optimization Method for SpMV on GPUs
Chen Yan, Boyu Diao, Hangda Liu, Zhulin An, Yongjun Xu
TL;DR
This work tackles the SpMV bottleneck on GPUs for large sparse matrices by introducing a Hash-based Partition Format (HBP) that combines nonlinear hashing with $2D$ partitioning to enable warp-level parallelism and reduce preprocessing overhead. The method comprises three components: a six-array HBP storage layout, a lightweight nonlinear hash-based reordering, and a Mixed Execution Allocation that balances load across matrix blocks and warps using fixed and competitive execution parts. Key contributions include the HBP format, a lightweight preprocessing step with nonlinear hashing, and substantial performance gains—average preprocessing speedups around 3.5x and SpMV speedups up to 3.32x on Jetson AGX Orin and 3.01x on RTX 4090 (CSR baseline), plus larger gains against 2D-partitioning. The results demonstrate improved parallel load balancing, reduced preprocessing costs, and notable SpMV throughput improvements, with practical impact for large-scale sparse computations on modern GPUs.
Abstract
Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.
