Table of Contents
Fetching ...

MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity

Vladimír Macko, Vladimír Boža

TL;DR

This work tackles the memory and compute challenges of SpMV for unstructured sparsity in pruned LLMs by introducing MACKO-SpMV, a mutually aligned compressed coordinates format with padding and a SplitK-based GPU kernel. The approach achieves significant memory reduction and speedups over dense representations and state-of-the-art SpMV baselines, particularly at 30–90% sparsity, and translates these gains to faster end-to-end LLM inference on consumer GPUs (e.g., Llama2-7B pruned with Wanda at 50% sparsity). Key contributions include the MACKO storage format, a GPU-optimized SpMV kernel, and extensive evaluation across multiple GPUs and models. The results demonstrate that unstructured pruning at moderate sparsity becomes practically viable for real-world workloads, enabling more accessible deployment of pruned LLMs on commodity hardware.

Abstract

Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.

MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity

TL;DR

This work tackles the memory and compute challenges of SpMV for unstructured sparsity in pruned LLMs by introducing MACKO-SpMV, a mutually aligned compressed coordinates format with padding and a SplitK-based GPU kernel. The approach achieves significant memory reduction and speedups over dense representations and state-of-the-art SpMV baselines, particularly at 30–90% sparsity, and translates these gains to faster end-to-end LLM inference on consumer GPUs (e.g., Llama2-7B pruned with Wanda at 50% sparsity). Key contributions include the MACKO storage format, a GPU-optimized SpMV kernel, and extensive evaluation across multiple GPUs and models. The results demonstrate that unstructured pruning at moderate sparsity becomes practically viable for real-world workloads, enabling more accessible deployment of pruned LLMs on commodity hardware.

Abstract

Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.

Paper Structure

This paper contains 28 sections, 10 equations, 9 figures, 1 table, 1 algorithm.

Figures (9)

  • Figure 1: Sparse matrix–vector multiplication runtime. Matrix size 36864, 12288 in fp16 on an NVIDIA GeForce RTX 4090 GPU. Using MACKO, sparse computation exceeds the performance of dense at sparsity as low as 25%. Existing libraries require 2.6× fewer non-zeros to achieve the same performance. The improvement of this work is displayed as highlighted region.
  • Figure 2: Effective density of different sparse matrix formats for 16-bit values. Q8 and Q4 reffer to quantization to 8 and 4 bits respectively. Expected effective density is shown for MACKO.
  • Figure 3: Example of MACKO storage format with 16 bit values and 2 bit deltas ($b_{val}=16$$b_\Delta=2$) compared to CSR16.
  • Figure 4: Relative speedup over cuBLAS across different matrix sizes.
  • Figure 5: Speedup of MACKO relative to cuBLAS on RTX 4090 for $12288\times12288$, cuBLAS runtime 336 $\mu s$.
  • ...and 4 more figures