Table of Contents
Fetching ...

SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization

Taisuke Yasuda, Kyriakos Axiotis, Gang Fu, MohammadHossein Bateni, Vahab Mirrokni

TL;DR

This work addresses the challenge of pruning neural networks in a structured, scalable way by unifying differentiable importance scoring with combinatorial optimization. It frames differentiable pruning as nonconvex regularization that aligns with group LASSO behavior, proving conditions for a unique sparse global minimum and linking to group-OMP/OMPR strategies. The authors introduce SequentialAttention++, a practical algorithm that combines differentiable softmax-based masking with an iterative sparsification–dense re-training loop inspired by ACDC, and provide theoretical guarantees for its nonconvex-regularization framework. Empirically, SequentialAttention++ achieves state-of-the-art block sparsification results on ImageNet and strong performance on Criteo, demonstrating practical benefits for large-scale structured pruning and hardware efficiency. The work advances both theory and practice in structured pruning, offering a cohesive view of how differentiable scoring and combinatorial search can jointly yield provably good sparse solutions.

Abstract

Neural network pruning is a key technique towards engineering large yet scalable, interpretable, and generalizable models. Prior work on the subject has developed largely along two orthogonal directions: (1) differentiable pruning for efficiently and accurately scoring the importance of parameters, and (2) combinatorial optimization for efficiently searching over the space of sparse models. We unite the two approaches, both theoretically and empirically, to produce a coherent framework for structured neural network pruning in which differentiable pruning guides combinatorial optimization algorithms to select the most important sparse set of parameters. Theoretically, we show how many existing differentiable pruning techniques can be understood as nonconvex regularization for group sparse optimization, and prove that for a wide class of nonconvex regularizers, the global optimum is unique, group-sparse, and provably yields an approximate solution to a sparse convex optimization problem. The resulting algorithm that we propose, SequentialAttention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the ImageNet and Criteo datasets.

SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization

TL;DR

This work addresses the challenge of pruning neural networks in a structured, scalable way by unifying differentiable importance scoring with combinatorial optimization. It frames differentiable pruning as nonconvex regularization that aligns with group LASSO behavior, proving conditions for a unique sparse global minimum and linking to group-OMP/OMPR strategies. The authors introduce SequentialAttention++, a practical algorithm that combines differentiable softmax-based masking with an iterative sparsification–dense re-training loop inspired by ACDC, and provide theoretical guarantees for its nonconvex-regularization framework. Empirically, SequentialAttention++ achieves state-of-the-art block sparsification results on ImageNet and strong performance on Criteo, demonstrating practical benefits for large-scale structured pruning and hardware efficiency. The work advances both theory and practice in structured pruning, offering a cohesive view of how differentiable scoring and combinatorial search can jointly yield provably good sparse solutions.

Abstract

Neural network pruning is a key technique towards engineering large yet scalable, interpretable, and generalizable models. Prior work on the subject has developed largely along two orthogonal directions: (1) differentiable pruning for efficiently and accurately scoring the importance of parameters, and (2) combinatorial optimization for efficiently searching over the space of sparse models. We unite the two approaches, both theoretically and empirically, to produce a coherent framework for structured neural network pruning in which differentiable pruning guides combinatorial optimization algorithms to select the most important sparse set of parameters. Theoretically, we show how many existing differentiable pruning techniques can be understood as nonconvex regularization for group sparse optimization, and prove that for a wide class of nonconvex regularizers, the global optimum is unique, group-sparse, and provably yields an approximate solution to a sparse convex optimization problem. The resulting algorithm that we propose, SequentialAttention++, advances the state of the art in large-scale neural network block-wise pruning tasks on the ImageNet and Criteo datasets.
Paper Structure (31 sections, 8 theorems, 18 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 31 sections, 8 theorems, 18 equations, 5 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1.1

Let $q:\mathbb R_+\to\mathbb R_+$ be strictly increasing, subadditive (i.e., $q(a+b)\leq q(a) + q(b)$ for $a,b\in\mathbb R^+$), and satisfy $q(0) = 0$. If eq:group-lasso has a unique minimizer $\boldsymbol{\beta}^*$ with group sparsity at most $1$, then $\boldsymbol{\beta}^*$ is also the unique mini

Figures (5)

  • Figure 1: Differentiable pruning of weight blocks
  • Figure 2: (a) Softmax attention vs magnitude pruning, and (b) the Sparsification phase.
  • Figure 3: Training accuracy vs step on ImageNet: Comparison between ACDC and SequentialAttention++. The setting is $90\%$ sparsity and $32\times 32$-size blocks.
  • Figure 4: Block sparsification on Imagenet.
  • Figure 5: Block sparsification on Criteo. There are no Powerpropagation results for block size $1$ because the algorithm diverged.

Theorems & Definitions (17)

  • Theorem 1.1: Unique sparse global minima
  • Lemma 2.1: Unnormalized softmax as log-sum regularization
  • Lemma 2.2: $\ell_1$-regularized masks as $\ell_q$ regularization
  • Lemma 2.3: Group powerpropagation as Group LASSO
  • Theorem 2.3: Unique sparse global minima
  • Lemma 2.4
  • proof
  • Lemma 2.5
  • proof
  • proof : Proof of Lemma \ref{['lem:unnormalized-softmax']}
  • ...and 7 more