Table of Contents
Fetching ...

Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors

Vasileios Titopoulos, Kosmas Alexandridis, Christodoulos Peltekis, Chrysostomos Nicopoulos, Giorgos Dimitrakopoulos

TL;DR

This work tackles accelerating structured-sparse matrix multiplication on RISC-V vector processors by reorganizing data placement and introducing a new vector instruction. It blends a Gustavson-style row-wise SpMM approach with mixed data placement across the vector register file and scalar RF and introduces vindexmac to perform indirect reads, reducing memory traffic. Experiments on a decoupled vector unit with pruned CNN workloads show up to 25-33% runtime improvement over current ISA kernels and memory-traffic reductions of about 42-63% for patterns such as $1:4$ and $2:4$, scalable with vector length $VL$. These results demonstrate practical gains for energy-efficient edge ML acceleration, offering a low-overhead path to exploiting structured sparsity in modern vector architectures.

Abstract

Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called vindexmac, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33% when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.

Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors

TL;DR

This work tackles accelerating structured-sparse matrix multiplication on RISC-V vector processors by reorganizing data placement and introducing a new vector instruction. It blends a Gustavson-style row-wise SpMM approach with mixed data placement across the vector register file and scalar RF and introduces vindexmac to perform indirect reads, reducing memory traffic. Experiments on a decoupled vector unit with pruned CNN workloads show up to 25-33% runtime improvement over current ISA kernels and memory-traffic reductions of about 42-63% for patterns such as and , scalable with vector length . These results demonstrate practical gains for energy-efficient edge ML acceleration, offering a low-overhead path to exploiting structured sparsity in modern vector architectures.

Abstract

Structured sparsity has been proposed as an efficient way to prune the complexity of Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. Accelerating ML models, whether for training, or inference, heavily relies on matrix multiplications that can be efficiently executed on vector processors, or custom matrix engines. This work aims to integrate the simplicity of structured sparsity into vector execution to speed up the corresponding matrix multiplications. Initially, the implementation of structured-sparse matrix multiplication using the current RISC-V instruction set vector extension is comprehensively explored. Critical parameters that affect performance, such as the impact of data distribution across the scalar and vector register files, data locality, and the effectiveness of loop unrolling are analyzed both qualitatively and quantitatively. Furthermore, it is demonstrated that the addition of a single new instruction would reap even higher performance. The newly proposed instruction is called vindexmac, i.e., vector index-multiply-accumulate. It allows for indirect reads from the vector register file and it reduces the number of instructions executed per matrix multiplication iteration, without introducing additional dependencies that would limit loop unrolling. The proposed new instruction was integrated in a decoupled RISC-V vector processor with negligible hardware cost. Experimental results demonstrate the runtime efficiency and the scalability offered by the introduced optimizations and the new instruction for the execution of state-of-the-art Convolutional Neural Networks. More particularly, the addition of a custom instruction improves runtime by 25% and 33% when compared with highly-optimized vectorized kernels that use only the currently defined RISC-V instructions.
Paper Structure (18 sections, 2 equations, 17 figures, 1 table, 3 algorithms)

This paper contains 18 sections, 2 equations, 17 figures, 1 table, 3 algorithms.

Figures (17)

  • Figure 1: Example of (a) unstructured sparsity; and (b) structured block sparsity of 2:4 (i.e., up to 2 non-zero elements in every 4 consecutive elements) and their respective representation. Blue squares represent non-zero elements.
  • Figure 2: Row-wise matrix multiplication to compute output row $C[0,:]$.
  • Figure 3: An example of recovering the actual column index of an element through the column index that is stored in the memory and the block_id.
  • Figure 4: The operation of the proposed vindexmac instruction. The contents of the scalar register are used to address a specific vector register. The vector read is multiplied with the least significant element of another vector register that is read in parallel. The result of the multiplication is accumulated with the previous contents of the vector destination register.
  • Figure 5: The order of execution of Alg. 6 across vertical segments of $A$ and $B$.
  • ...and 12 more figures