MACKO: Sparse Matrix-Vector Multiplication for Low Sparsity
Vladimír Macko, Vladimír Boža
TL;DR
This work tackles the memory and compute challenges of SpMV for unstructured sparsity in pruned LLMs by introducing MACKO-SpMV, a mutually aligned compressed coordinates format with padding and a SplitK-based GPU kernel. The approach achieves significant memory reduction and speedups over dense representations and state-of-the-art SpMV baselines, particularly at 30–90% sparsity, and translates these gains to faster end-to-end LLM inference on consumer GPUs (e.g., Llama2-7B pruned with Wanda at 50% sparsity). Key contributions include the MACKO storage format, a GPU-optimized SpMV kernel, and extensive evaluation across multiple GPUs and models. The results demonstrate that unstructured pruning at moderate sparsity becomes practically viable for real-world workloads, enabling more accessible deployment of pruned LLMs on commodity hardware.
Abstract
Sparse Matrix-Vector Multiplication (SpMV) is a fundamental operation in the inference of sparse Large Language Models (LLMs). Because existing SpMV methods perform poorly under the low and unstructured sparsity (30-90%) commonly observed in pruned LLMs, unstructured pruning provided only limited memory reduction and speedup. We propose MACKO-SpMV, a GPU-optimized format and kernel co-designed to reduce storage overhead while preserving compatibility with the GPU's execution model. This enables efficient SpMV for unstructured sparsity without specialized hardware units (e.g., tensor cores) or format-specific precomputation. Empirical results show that at sparsity 50%, MACKO is the first approach with significant 1.5x memory reduction and 1.2-1.5x speedup over dense representation. Speedups over other SpMV baselines: 2.8-13.0x over cuSPARSE, 1.9-2.6x over Sputnik, and 2.2-2.5x over DASP. Applied to Llama2-7B pruned with Wanda to sparsity 50%, it delivers 1.5x memory reduction and 1.5x faster inference at fp16 precision. Thanks to MACKO, unstructured pruning at 50% sparsity is now justified in real-world LLM workloads.
