Table of Contents
Fetching ...

Deterministic Differentiable Structured Pruning for Large Language Models

Weiyu Huang, Pengle Zhang, Xiaolu Zhang, Jun Zhou, Jun Zhu, Jianfei Chen

TL;DR

Deterministic Differentiable Pruning is proposed, a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective and offers greater expressiveness, reduced train--test mismatch, and faster convergence.

Abstract

Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.

Deterministic Differentiable Structured Pruning for Large Language Models

TL;DR

Deterministic Differentiable Pruning is proposed, a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective and offers greater expressiveness, reduced train--test mismatch, and faster convergence.

Abstract

Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0 norm, prior work typically adopts stochastic hard-concrete relaxations to enable differentiable optimization; however, this stochasticity can introduce a train--test mismatch when sampled masks are discretized for deployment and restricts masks to a bounded, near-binary range. To address this, we propose Deterministic Differentiable Pruning (DDP), a mask-only optimization method that eliminates stochasticity by directly optimizing a deterministic soft surrogate of the discrete l0 objective. Compared with prior approaches, DDP offers greater expressiveness, reduced train--test mismatch, and faster convergence. We apply our method to several dense and MoE models, including Qwen3-32B and Qwen3-30B-A3B, achieving a performance loss as small as 1% on downstream tasks while outperforming previous methods at 20% sparsity. We further demonstrate end-to-end inference speedups in realistic deployment settings with vLLM.
Paper Structure (52 sections, 5 theorems, 59 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 52 sections, 5 theorems, 59 equations, 6 figures, 11 tables, 1 algorithm.

Key Result

Theorem 3.1

Consider the staged optimization of eq:lp_total with annealing $\mu \downarrow 0$. Under certain assumptions, for each stage $r=1,2,3..$ (with fixed $\mu_r$), the deterministic augmented-Lagrangian updates generate iterates whose accumulation points are (i) feasible for the relaxed equality constrai

Figures (6)

  • Figure 1: Deterministic Differentiable Pruning overview.Left: Masked formulation for dense and MoE models. For dense models, we prune attention heads and MLP channels; for MoE models, we prune expert channels only. Right: Mask-only optimization with decoupled forward masks and retention scores for regularization, enabling deterministic training and an expanded mask range.
  • Figure 2: Deterministic surrogate mapping in DDP. Annealing $\mu$ sharpens the soft sigmoid projection, progressively approximating $\ell_0$ regularization.
  • Figure 3: Effect of different training tokens on perplexity and zero-shot mean accuracy on different models.
  • Figure 4: Dense-model sparsity patterns (LLaMA-7B, 20% sparsity). Left: layer-wise MLP channel sparsity. Right: learned head sparsity map.
  • Figure 5: Dense-model sparsity patterns (LLaMA-7B, 50% sparsity). Sparsity increases toward later layers, and head pruning becomes more selective under higher compression.
  • ...and 1 more figures

Theorems & Definitions (9)

  • Theorem 3.1: Exact budget recovery (informal)
  • Lemma 2.1: Margin-separated relaxation approaches the discrete count
  • proof
  • Proposition 2.2: Binarization regularizer drives $\boldsymbol{s}$ toward $\{0,1\}$
  • proof
  • Theorem 2.3: Deterministic inexact ALM yields KKT accumulation points (STE surrogate)
  • proof
  • Theorem 2.4: DDP recovers the exact hard $P$-budget in the annealed limit
  • proof : Proof of Theorem \ref{['thm:lp_hard_budget']}