Table of Contents
Fetching ...

A Nonlinear Hash-based Optimization Method for SpMV on GPUs

Chen Yan, Boyu Diao, Hangda Liu, Zhulin An, Yongjun Xu

TL;DR

This work tackles the SpMV bottleneck on GPUs for large sparse matrices by introducing a Hash-based Partition Format (HBP) that combines nonlinear hashing with $2D$ partitioning to enable warp-level parallelism and reduce preprocessing overhead. The method comprises three components: a six-array HBP storage layout, a lightweight nonlinear hash-based reordering, and a Mixed Execution Allocation that balances load across matrix blocks and warps using fixed and competitive execution parts. Key contributions include the HBP format, a lightweight preprocessing step with nonlinear hashing, and substantial performance gains—average preprocessing speedups around 3.5x and SpMV speedups up to 3.32x on Jetson AGX Orin and 3.01x on RTX 4090 (CSR baseline), plus larger gains against 2D-partitioning. The results demonstrate improved parallel load balancing, reduced preprocessing costs, and notable SpMV throughput improvements, with practical impact for large-scale sparse computations on modern GPUs.

Abstract

Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.

A Nonlinear Hash-based Optimization Method for SpMV on GPUs

TL;DR

This work tackles the SpMV bottleneck on GPUs for large sparse matrices by introducing a Hash-based Partition Format (HBP) that combines nonlinear hashing with partitioning to enable warp-level parallelism and reduce preprocessing overhead. The method comprises three components: a six-array HBP storage layout, a lightweight nonlinear hash-based reordering, and a Mixed Execution Allocation that balances load across matrix blocks and warps using fixed and competitive execution parts. Key contributions include the HBP format, a lightweight preprocessing step with nonlinear hashing, and substantial performance gains—average preprocessing speedups around 3.5x and SpMV speedups up to 3.32x on Jetson AGX Orin and 3.01x on RTX 4090 (CSR baseline), plus larger gains against 2D-partitioning. The results demonstrate improved parallel load balancing, reduced preprocessing costs, and notable SpMV throughput improvements, with practical impact for large-scale sparse computations on modern GPUs.

Abstract

Sparse matrix-vector multiplication (SpMV) is a fundamental operation with a wide range of applications in scientific computing and artificial intelligence. However, the large scale and sparsity of sparse matrix often make it a performance bottleneck. In this paper, we highlight the effectiveness of hash-based techniques in optimizing sparse matrix reordering, introducing the Hash-based Partition (HBP) format, a lightweight SpMV approach. HBP retains the performance benefits of the 2D-partitioning method while leveraging the hash transformation's ability to group similar elements, thereby accelerating the pre-processing phase of sparse matrix reordering. Additionally, we achieve parallel load balancing across matrix blocks through a competitive method. Our experiments, conducted on both Nvidia Jetson AGX Orin and Nvidia RTX 4090, show that in the pre-processing step, our method offers an average speedup of 3.53 times compared to the sorting approach and 3.67 times compared to the dynamic programming method employed in Regu2D. Furthermore, in SpMV, our method achieves a maximum speedup of 3.32 times on Orin and 3.01 times on RTX4090 against the CSR format in sparse matrices from the University of Florida Sparse Matrix Collection.

Paper Structure

This paper contains 10 sections, 13 figures, 2 tables, 3 algorithms.

Figures (13)

  • Figure 1: Two-step SpMV. The matrix blocks and their corresponding vector segments are multiplied to obtain a set of vectors in the SpMV part. And the vectors in the same row are merged to get the final results in the combine part.
  • Figure 2: An example of HBP format. We assume that the warp consists of 4 threads. $col$ and $data$ are stored in the order after hash transformation. $zero\_row$ and $begin\_ptr$ are used to find the first element that the current thread needs to calculate. $add\_sign$ represents the distance from the current element to the next element within the same row.
  • Figure 3: The format of nonlinear hash.
  • Figure 4: An example that shows the aggregation of Hash. Rows with fewer nonzero elements are aggregated after nonlinear hash mapping and computed by the warp of threads first.
  • Figure 5: An example of SpMV with mixed execution. Each matrix block is calculated by a warp. After a warp finishes computing its fixed parts, it will select the uncomputed matrix blocks from the competitive parts to perform SpMV.
  • ...and 8 more figures