Table of Contents
Fetching ...

OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

Xiang Meng, Shibal Ibrahim, Kayhan Behdin, Hussein Hazimeh, Natalia Ponomareva, Rahul Mazumder

TL;DR

OSSCAR tackles the challenge of one-shot structured pruning for very large vision and language models by reformulating the pruning task as a scalable combinatorial optimization problem rooted in a layer-wise reconstruction objective. It introduces a MIQP-based framework with a clever reformulation that collapses the enormous weight-space into a tractable form via grouping, paired with a local search algorithm that uses low-rank updates to efficiently explore the pruning space. The approach yields substantial practical gains: improved accuracy for CNNs like ResNet50 at ~2x speedups, and dramatic perplexity reductions for OPT models (e.g., up to 125x lower perplexity at 2x speed) with 6–8x faster pruning times, extending applicability to models with tens of billions of parameters. These results demonstrate OSSCAR’s scalability and potential to enable effective post-training pruning on models far larger than those previously addressed, with open-source code to foster adoption.

Abstract

Structured pruning is a promising approach for reducing the inference costs of large vision and language models. By removing carefully chosen structures, e.g., neurons or attention heads, the improvements from this approach can be realized on standard deep learning hardware. In this work, we focus on structured pruning in the one-shot (post-training) setting, which does not require model retraining after pruning. We propose a novel combinatorial optimization framework for this problem, based on a layer-wise reconstruction objective and a careful reformulation that allows for scalable optimization. Moreover, we design a new local combinatorial optimization algorithm, which exploits low-rank updates for efficient local search. Our framework is time and memory-efficient and considerably improves upon state-of-the-art one-shot methods on vision models (e.g., ResNet50, MobileNet) and language models (e.g., OPT-1.3B -- OPT-30B). For language models, e.g., OPT-2.7B, OSSCAR can lead to $125\times$ lower test perplexity on WikiText with $2\times$ inference time speedup in comparison to the state-of-the-art ZipLM approach. Our framework is also $6\times$ -- $8\times$ faster. Notably, our work considers models with tens of billions of parameters, which is up to $100\times$ larger than what has been previously considered in the structured pruning literature.

OSSCAR: One-Shot Structured Pruning in Vision and Language Models with Combinatorial Optimization

TL;DR

OSSCAR tackles the challenge of one-shot structured pruning for very large vision and language models by reformulating the pruning task as a scalable combinatorial optimization problem rooted in a layer-wise reconstruction objective. It introduces a MIQP-based framework with a clever reformulation that collapses the enormous weight-space into a tractable form via grouping, paired with a local search algorithm that uses low-rank updates to efficiently explore the pruning space. The approach yields substantial practical gains: improved accuracy for CNNs like ResNet50 at ~2x speedups, and dramatic perplexity reductions for OPT models (e.g., up to 125x lower perplexity at 2x speed) with 6–8x faster pruning times, extending applicability to models with tens of billions of parameters. These results demonstrate OSSCAR’s scalability and potential to enable effective post-training pruning on models far larger than those previously addressed, with open-source code to foster adoption.

Abstract

Structured pruning is a promising approach for reducing the inference costs of large vision and language models. By removing carefully chosen structures, e.g., neurons or attention heads, the improvements from this approach can be realized on standard deep learning hardware. In this work, we focus on structured pruning in the one-shot (post-training) setting, which does not require model retraining after pruning. We propose a novel combinatorial optimization framework for this problem, based on a layer-wise reconstruction objective and a careful reformulation that allows for scalable optimization. Moreover, we design a new local combinatorial optimization algorithm, which exploits low-rank updates for efficient local search. Our framework is time and memory-efficient and considerably improves upon state-of-the-art one-shot methods on vision models (e.g., ResNet50, MobileNet) and language models (e.g., OPT-1.3B -- OPT-30B). For language models, e.g., OPT-2.7B, OSSCAR can lead to lower test perplexity on WikiText with inference time speedup in comparison to the state-of-the-art ZipLM approach. Our framework is also -- faster. Notably, our work considers models with tens of billions of parameters, which is up to larger than what has been previously considered in the structured pruning literature.
Paper Structure (24 sections, 3 theorems, 32 equations, 6 figures, 11 tables, 1 algorithm)

This paper contains 24 sections, 3 theorems, 32 equations, 6 figures, 11 tables, 1 algorithm.

Key Result

Proposition 4.1

Given two sets $S$ and $S'$. Suppose we have computed the value of $f(S')$, the inverse of $H_{I_{S'},I_{S'}}$ and the optimal weight matrix for $f(S')$. The value of $f(S)$, the inverse of $H_{I_{S},I_{S}}$ and the optimal weight matrix for $f(S)$ can then be computed within $O(td_1(d_1+d_2))$ time

Figures (6)

  • Figure 1: Comparison of two methods for representing the weight. Top: Traditional vanilla formulation representing the weight as a single vector of size $C_{in} \times k_H \times k_W \times C_{out}$, potentially consisting of hundreds of millions of variables. Left: Our proposed method, representing the weight as a $(C_{in} \times k_H \times k_W) \times C_{out}$ matrix. Here, each column corresponds to the weights of a single convolutional filter and shares the same quadratic coefficient. Therefore, this representation significantly reduces the problem's scale. In both figures, cells with different colors represent weights acting on different input channels, with each channel being acted on by $(C_{out} \times k_H \times k_W)$ weights.
  • Figure 2: Illustration of structured pruning in a convolutional layer to reduce feature map $X$'s width by pruning weights that act on certain input channels. This figure shows an example where weights acting on the second input channel (denoted by gray cells) are pruned. Consequently, the corresponding channel (also in gray) in the feature map $X$ becomes redundant and can be removed.
  • Figure 3: Illustration of structured pruning to accelerate a Transformer layer. The multi-head attention comprises $C_h$ distinct single-head attention blocks. The outputs from these blocks are concatenated and processed through a linear sublayer. We prune the linear sublayer for multi-head integration and the second sublayer of the feed-forward network, as marked in red.
  • Figure 4: The structure of sets $\{S_i\}_{i=1}^T$ under different choices of $\hat{p}$. Left: with a large $\hat{p}$, Algorithm \ref{['alg:localsearch']} mimics a greedy pruning approach, incrementally expanding the set $S$ and resulting in nested sets $S_1\subset S_2\subset S_3\cdots$. Right: with a small $\hat{p}$, Algorithm \ref{['alg:localsearch']} employs a local swapping strategy, leading to sets without a nested structure.
  • Figure 5: Perplexity performance on WikiText (in log-scale) for one-shot structured pruning of OPT models (6.7B, 13B, and 30B). The speedup ratio denotes the inference time improvement of pruned models over dense models. For all methods, we take ten runs and report the mean perplexity.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Proposition 4.1
  • Proposition 4.2
  • Lemma 1.1