Table of Contents
Fetching ...

LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures

Kelun Lei, Hailong Yang, Kaige Zhang, Kejie Ma, Yiqing Wang, Xin You, Yufan Xu, Enrique S. Quintana-Orti, Zhongzhi Luan, Yi Liu, Depei Qian

TL;DR

This work tackles SpMM performance on Armv9 CPUs equipped with Scalable Matrix Extension (SME) by introducing LOOPS, a hybrid CSR-part and vector-wise BCSR-part framework that enables coordinated use of NEON and SME. LOOPS employs a two-level parallelization and a lightweight performance model to dynamically partition work between the vector and matrix units, achieving substantial speedups over CPU baselines and competitive energy efficiency against GPU implementations. Key contributions include the LOOPS hybrid format, FP64/FP32/FP16 support, an adaptive scheduler, and end-to-end validation on SuiteSparse and GNN workloads. The results demonstrate that exploiting SME-tied matrix parallelism alongside NEON yields high throughput with improved energy efficiency, making LOOPS particularly attractive for sparse workloads on energy-constrained ARM CPUs and edge deployments.

Abstract

Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves average speedups of 9.93$\times$ (FP32)/14.4$\times$ (FP64) against the CPU baseline TACO and 71.3$\times$ (FP32)/54.8$\times$ (FP64) with respect to Armadillo. A comparison of LOOPS running on the same CPU with two GPU methods (cuSPARSE, Magicube) executed on an NVIDIA A100 GPU show average speedups for LOOPS between 19.8$\times$ and 33.5$\times$, depending on the precision. Notably, LOOPS delivers significantly better energy efficiency than the GPU codes on the A100 GPU.

LOw-cOst yet High-Performant Sparse Matrix-Matrix Multiplication on Arm SME Architectures

TL;DR

This work tackles SpMM performance on Armv9 CPUs equipped with Scalable Matrix Extension (SME) by introducing LOOPS, a hybrid CSR-part and vector-wise BCSR-part framework that enables coordinated use of NEON and SME. LOOPS employs a two-level parallelization and a lightweight performance model to dynamically partition work between the vector and matrix units, achieving substantial speedups over CPU baselines and competitive energy efficiency against GPU implementations. Key contributions include the LOOPS hybrid format, FP64/FP32/FP16 support, an adaptive scheduler, and end-to-end validation on SuiteSparse and GNN workloads. The results demonstrate that exploiting SME-tied matrix parallelism alongside NEON yields high throughput with improved energy efficiency, making LOOPS particularly attractive for sparse workloads on energy-constrained ARM CPUs and edge deployments.

Abstract

Sparse matrix-dense matrix multiplication (SpMM) is a critical kernel in both scientific computing and emerging graph learning workloads. The recent Armv9 architecture introduces Scalable Matrix Extension (SME), enabling tile-based matrix operations with high throughput. However, effectively exploiting both SME and traditional SIMD resources for unstructured sparse workloads remains an open challenge. To address this, we propose LOOPS, a hybrid execution framework that combines row-wise CSR-part with vector-wise BCSR-part layout, enabling cooperative utilization of vector instructions (NEON) and Scalable Matrix Extension (SME) resources. LOOPS supports multi-precision SpMM across FP64, FP32, and FP16 via an adaptive two-level parallelization scheme guided by a lightweight performance model. Experimental results on the entire SuiteSparse on an Apple's M4Pro CPU show that LOOPS achieves average speedups of 9.93 (FP32)/14.4 (FP64) against the CPU baseline TACO and 71.3 (FP32)/54.8 (FP64) with respect to Armadillo. A comparison of LOOPS running on the same CPU with two GPU methods (cuSPARSE, Magicube) executed on an NVIDIA A100 GPU show average speedups for LOOPS between 19.8 and 33.5, depending on the precision. Notably, LOOPS delivers significantly better energy efficiency than the GPU codes on the A100 GPU.

Paper Structure

This paper contains 23 sections, 3 equations, 6 figures, 4 tables, 3 algorithms.

Figures (6)

  • Figure 1: The overview of our SpMM pipeline.
  • Figure 2: The BCSR-part SpMM workflow.
  • Figure 3: Illustration of FP16 two way fmopa instruction.
  • Figure 4: Overall SpMM performance (in GFLOPS) of CPU- (on Apple M4Pro) and GPU SpMM methods (on NVIDIA A100) in (a) FP64 and (b) FP32 precisions.
  • Figure 5: Overall SpMM performance (in GFLOPS) comparison of our method (on Apple M4Pro) with GPU SpMM methods (on NVIDIA A100) in FP16 precision.
  • ...and 1 more figures