SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao; Yingbo Hao; Ting Song; Yan Xia; Di Zhang; Shaohan Huang; Xun Wu; Songchen Xu; Le Xu; Li Dong; Zewen Chi; Yi Zou; Furu Wei

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei

TL;DR

Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost.

Abstract

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

TL;DR

Sliding Window Decomposition reconstructs any

weight block into

overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost.

Abstract

NVIDIA's 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning -- a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder

patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the

model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any

weight block into

overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound

at 6:8 weight sparsity in Qwen2.5-7B, establishing

as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.

Paper Structure (132 sections, 16 equations, 10 figures, 2 tables, 2 algorithms)

This paper contains 132 sections, 16 equations, 10 figures, 2 tables, 2 algorithms.

Introduction
Motivation
2:4 Sparsity: Fast but Too Aggressive
The Deployment Gap
Our Approach: Computational Arbitrage
SlideSparse Method
Problem: The Sparsity Mismatch
Solution: Sliding Window Decomposition
Activation Lifting
Cost Analysis: When Does SlideSparse Pay Off?
SlideSparse System Implementation
Offline Weight Packer
Fused Quantization-Slide Kernel
System Integration
Experiments
...and 117 more sections

Figures (10)

Figure 1: SlideSparse extends 2:4 Sparse Tensor Cores to the $\mathbf{(2N{-}2):2N}$ sparsity family. (a) SlideSparse transforms 6:8 weights into 2:4-compliant blocks, enabling sparsity acceleration. (b) End-to-end speedup on A100 (INT8, seq_len$=$8K) approaches the theoretical limit $S_{\max}=N/(N{-}1)=3/2, 4/3, 5/4, ...$ (§\ref{['sec:method']}).
Figure 2: Reasoning accuracy of Qwen3 yang2025qwen3technicalreport under different sparsity. 6:8 preserves near-dense performance (51.6% vs. 54.0%); 2:4 collapses to 15.3%.
Figure 3: Two-dimensional compression space for LLM acceleration. X-axis: quantization precision (16-bit to 1.58-bit BitNet, up to $8\times$ speedup). Y-axis: sparsity (dense to 2:4, up to $2\times$ speedup). Gray dots mark existing hardware support---limited to dense or 2:4 extremes. Green dots show $(2N{-}2):2N$ patterns that SlideSparse enables, filling the Acceleration Gap and unlocking fine-grained sparsity--precision trade-offs.
Figure 4: Sliding window decomposition for 6:8 sparsity. Three stride-2 windows (each size 4) cover all 8 positions. Overlap regions allow non-zeros to spill into the next windows when one reaches capacity, converting any $(2N{-}2):2N$ pattern into concatenated 2:4 blocks for Sparse Tensor Core acceleration.
Figure 5: SlideSparse system overview.Offline: Weight preprocessing transforms $(2N{-}2):2N$ sparse weights into slided format with $\gamma\times$ expansion. Initialization: cuSPARSELt compresses weights into 2:4 format at model load time. Online: Per-request inference executes fused quantization-slide kernel followed by sparse GEMM.
...and 5 more figures

Theorems & Definitions (4)

proof
proof
proof
proof

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

TL;DR

Abstract

SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (4)