Table of Contents
Fetching ...

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang

TL;DR

The paper addresses the efficiency-accuracy gap in pruning large language models by focusing on semi-structured 2:4 sparsity. It introduces ARMOR, a one-shot post-training pruning method that factorizes each weight matrix as $\hat{W} = A \cdot (W' \odot M) \cdot B$ with a $2:4$ sparse core and learnable block-diagonal wrappers, enabling better fidelity while preserving hardware speedups. The optimization alternates continuous updates for $A,B,W'$ and discrete updates to the sparse core, guided by a NoWag-style proxy loss, with a convergence guarantee and initialization tied to NoWag-P. Empirically, ARMOR consistently outperforms existing $2:4$ pruning methods on Llama and Qwen models for perplexity and downstream tasks, while maintaining practical speedups and memory reductions. The work demonstrates that rethinking weight representations can yield superior accuracy–efficiency trade-offs for practical LLM deployment.

Abstract

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

TL;DR

The paper addresses the efficiency-accuracy gap in pruning large language models by focusing on semi-structured 2:4 sparsity. It introduces ARMOR, a one-shot post-training pruning method that factorizes each weight matrix as with a sparse core and learnable block-diagonal wrappers, enabling better fidelity while preserving hardware speedups. The optimization alternates continuous updates for and discrete updates to the sparse core, guided by a NoWag-style proxy loss, with a convergence guarantee and initialization tied to NoWag-P. Empirically, ARMOR consistently outperforms existing pruning methods on Llama and Qwen models for perplexity and downstream tasks, while maintaining practical speedups and memory reductions. The work demonstrates that rethinking weight representations can yield superior accuracy–efficiency trade-offs for practical LLM deployment.

Abstract

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

Paper Structure

This paper contains 42 sections, 5 theorems, 36 equations, 3 figures, 6 tables, 3 algorithms.

Key Result

Theorem 3.1

(Convergence of the ARMOR optimization algorithm). The sequence $\{\mathcal{L}_{W,X}((\theta)_{t})\}_{t\geq 0}$ converges and $\mathcal{L}_{W,X}((\theta)_{t})\leq\mathcal{L}_{W,X}((\theta)_{0}) \quad\forall\quad t>0.$

Figures (3)

  • Figure 1: Illustration of proposed ARMOR factorization. For a given LLM, each weight matrix $W$ is pruned individually. Instead of naively pruning the weight matrix, ARMOR wraps the sparse core with a pair of block diagonal matrices and uses a unique optimization algorithm to find the optimal structured pruning mask. $M\in\{0,1\}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}$ represents the 2:4 binary mask.
  • Figure 2: An illustration of the sparse core update step of the ARMOR optimization algorithm
  • Figure 3: Left: Relative average Proxy Loss and C4 Perplexity of Llama-2 7B across 20,000 iterations of the ARMOR Proxy Loss optimization algorithm with block size 128. Right: Relative C4 Perplexity for Lama-2 7B/13B, and Llama-3 8B across block sizes of 1, 8, 16, 32, 64, and 128. Each block size was only optimized for 5000 iterations due to time constraints. Relative perplexity is with respect to initial and optimal (dense) perplexities.

Theorems & Definitions (9)

  • Theorem 3.1
  • Proposition 1
  • proof
  • Lemma C.0
  • proof
  • Lemma C.0
  • proof
  • Theorem C.1
  • proof