Table of Contents
Fetching ...

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

TL;DR

Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost.

Abstract

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

TL;DR

Sliding Window Decomposition reconstructs any weight block into overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost.

Abstract

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any weight block into overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound at 6:8 weight sparsity in Qwen2.5-7B, establishing as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.
Paper Structure (132 sections, 16 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 132 sections, 16 equations, 10 figures, 2 tables, 2 algorithms.

Figures (10)

  • Figure 1: SlideSparse extends 2:4 Sparse Tensor Cores to the $\mathbf{(2N{-}2):2N}$ sparsity family. (a) SlideSparse transforms 6:8 weights into 2:4-compliant blocks, enabling sparsity acceleration. (b) End-to-end speedup on A100 (INT8, seq_len$=$8K) approaches the theoretical limit $S_{\max}=N/(N{-}1)=3/2, 4/3, 5/4, ...$ (§\ref{['sec:method']}).
  • Figure 2: Reasoning accuracy of Qwen3 yang2025qwen3technicalreport under different sparsity. 6:8 preserves near-dense performance (51.6% vs. 54.0%); 2:4 collapses to 15.3%.
  • Figure 3: Two-dimensional compression space for LLM acceleration. X-axis: quantization precision (16-bit to 1.58-bit BitNet, up to $8\times$ speedup). Y-axis: sparsity (dense to 2:4, up to $2\times$ speedup). Gray dots mark existing hardware support---limited to dense or 2:4 extremes. Green dots show $(2N{-}2):2N$ patterns that SlideSparse enables, filling the Acceleration Gap and unlocking fine-grained sparsity--precision trade-offs.
  • Figure 4: Sliding window decomposition for 6:8 sparsity. Three stride-2 windows (each size 4) cover all 8 positions. Overlap regions allow non-zeros to spill into the next windows when one reaches capacity, converting any $(2N{-}2):2N$ pattern into concatenated 2:4 blocks for Sparse Tensor Core acceleration.
  • Figure 5: SlideSparse system overview.Offline: Weight preprocessing transforms $(2N{-}2):2N$ sparse weights into slided format with $\gamma\times$ expansion. Initialization: cuSPARSELt compresses weights into 2:4 format at model load time. Online: Per-request inference executes fused quantization-slide kernel followed by sparse GEMM.
  • ...and 5 more figures

Theorems & Definitions (4)

  • proof
  • proof
  • proof
  • proof