Table of Contents
Fetching ...

SpChar: Characterizing the Sparse Puzzle via Decision Trees

Francesco Sgherzi, Marco Siracusa, Ivan Fernandez, Adrià Armejach, Miquel Moretó

TL;DR

SpChar presents a holistic workload characterization framework for sparse computation that jointly considers input structure, sparse algorithms, and hardware. It uses decision-tree regressors to extract the most impactful architectural and input features from static matrix metrics and performance counters across three Arm platforms for SpMV, SpGEMM, and SpADD. Key findings identify memory-system latency, branch misprediction overhead, and poor cache reuse as primary bottlenecks, with concrete optimizations such as memory-interface specialization, larger caches, and prefetching yielding substantial speedups in simulated deployments. The methodology enables a practical characterization loop to guide hardware-software co-design for sparse workloads, with broad applicability across kernels and inputs.

Abstract

Sparse matrix computation is crucial in various modern applications, including large-scale graph analytics, deep learning, and recommender systems. The performance of sparse kernels varies greatly depending on the structure of the input matrix, making it difficult to gain a comprehensive understanding of sparse computation and its relationship to inputs, algorithms, and target machine architecture. Despite extensive research on certain sparse kernels, such as Sparse Matrix-Vector Multiplication (SpMV), the overall family of sparse algorithms has yet to be investigated as a whole. This paper introduces SpChar, a workload characterization methodology for general sparse computation. SpChar employs tree-based models to identify the most relevant hardware and input characteristics, starting from hardware and input-related metrics gathered from Performance Monitoring Counters (PMCs) and matrices. Our analysis enables the creation of a characterization loop that facilitates the optimization of sparse computation by mapping the impact of architectural features to inputs and algorithmic choices. We apply SpChar to more than 600 matrices from the SuiteSparse Matrix collection and three state-of-the-art Arm CPUs to determine the critical hardware and software characteristics that affect sparse computation. In our analysis, we determine that the biggest limiting factors for high-performance sparse computation are (1) the latency of the memory system, (2) the pipeline flush overhead resulting from branch misprediction, and (3) the poor reuse of cached elements. Additionally, we propose software and hardware optimizations that designers can implement to create a platform suitable for sparse computation. We then investigate these optimizations using the gem5 simulator to achieve a significant speedup of up to 2.63x compared to a CPU where the optimizations are not applied.

SpChar: Characterizing the Sparse Puzzle via Decision Trees

TL;DR

SpChar presents a holistic workload characterization framework for sparse computation that jointly considers input structure, sparse algorithms, and hardware. It uses decision-tree regressors to extract the most impactful architectural and input features from static matrix metrics and performance counters across three Arm platforms for SpMV, SpGEMM, and SpADD. Key findings identify memory-system latency, branch misprediction overhead, and poor cache reuse as primary bottlenecks, with concrete optimizations such as memory-interface specialization, larger caches, and prefetching yielding substantial speedups in simulated deployments. The methodology enables a practical characterization loop to guide hardware-software co-design for sparse workloads, with broad applicability across kernels and inputs.

Abstract

Sparse matrix computation is crucial in various modern applications, including large-scale graph analytics, deep learning, and recommender systems. The performance of sparse kernels varies greatly depending on the structure of the input matrix, making it difficult to gain a comprehensive understanding of sparse computation and its relationship to inputs, algorithms, and target machine architecture. Despite extensive research on certain sparse kernels, such as Sparse Matrix-Vector Multiplication (SpMV), the overall family of sparse algorithms has yet to be investigated as a whole. This paper introduces SpChar, a workload characterization methodology for general sparse computation. SpChar employs tree-based models to identify the most relevant hardware and input characteristics, starting from hardware and input-related metrics gathered from Performance Monitoring Counters (PMCs) and matrices. Our analysis enables the creation of a characterization loop that facilitates the optimization of sparse computation by mapping the impact of architectural features to inputs and algorithmic choices. We apply SpChar to more than 600 matrices from the SuiteSparse Matrix collection and three state-of-the-art Arm CPUs to determine the critical hardware and software characteristics that affect sparse computation. In our analysis, we determine that the biggest limiting factors for high-performance sparse computation are (1) the latency of the memory system, (2) the pipeline flush overhead resulting from branch misprediction, and (3) the poor reuse of cached elements. Additionally, we propose software and hardware optimizations that designers can implement to create a platform suitable for sparse computation. We then investigate these optimizations using the gem5 simulator to achieve a significant speedup of up to 2.63x compared to a CPU where the optimizations are not applied.
Paper Structure (28 sections, 3 equations, 19 figures, 7 tables, 3 algorithms)

This paper contains 28 sections, 3 equations, 19 figures, 7 tables, 3 algorithms.

Figures (19)

  • Figure 1: Row-wise partitioning scheme for on 3 threads. Each thread operates on the contiguous set of rows displayed on the left, which translates to the influence regions on the right. Note that partitions of the row_ptrs array overlap because each thread requires knowing what is the index of the last element the previous thread operated on.
  • Figure 2: Four of our generation methods: we can generate matrices following several distributions of nonzeros and that stress specific architectural features.
  • Figure 3: Temporal locality, Spatial locality, Branch entropy and Thread imbalance for our 9 matrix categories.
  • Figure 4: Thread imbalance on two different matrices. (a) and (b) depict the sparse matrices as an adjacency matrix. atmosmoddsuitesparse exhibits a more consistent structure than std1_Jac2suitesparse which leads to it having orders of magnitude lower thread imbalance. We omit the value of thread imbalance on two threads for atmosmodd since it is $0$.
  • Figure 5: of $10$-fold cross-validation applied to each .
  • ...and 14 more figures