Table of Contents
Fetching ...

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

Boxiang Zhang, Baijian Yang

TL;DR

CORP tackles the challenge of deploying Vision Transformers under strict post-training constraints by introducing a closed-form, one-shot structured pruning method that preserves representations. It reframes pruning as a representation-recovery problem and derives affine and logit compensation that folds into weights without gradients or fine-tuning, enabling pruning of both MLP and attention while using only a small calibration set. The approach achieves strong accuracy preservation on ImageNet across DeiT sizes (e.g., 82.8% Top-1 on DeiT-Huge at 50% sparsity) and delivers real hardware speedups without retraining. By shifting focus from importance ranking to explicit representation compensation, CORP provides a scalable, deployment-friendly solution for compressing Vision Transformers.

Abstract

Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8\% Top-1 accuracy after pruning 50\% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.

CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Vision Transformers

TL;DR

CORP tackles the challenge of deploying Vision Transformers under strict post-training constraints by introducing a closed-form, one-shot structured pruning method that preserves representations. It reframes pruning as a representation-recovery problem and derives affine and logit compensation that folds into weights without gradients or fine-tuning, enabling pruning of both MLP and attention while using only a small calibration set. The approach achieves strong accuracy preservation on ImageNet across DeiT sizes (e.g., 82.8% Top-1 on DeiT-Huge at 50% sparsity) and delivers real hardware speedups without retraining. By shifting focus from importance ranking to explicit representation compensation, CORP provides a scalable, deployment-friendly solution for compressing Vision Transformers.

Abstract

Vision Transformers achieve strong accuracy but incur high compute and memory cost. Structured pruning can reduce inference cost, but most methods rely on retraining or multi-stage optimization. These requirements limit post-training deployment. We propose \textbf{CORP}, a closed-form one-shot structured pruning framework for Vision Transformers. CORP removes entire MLP hidden dimensions and attention substructures without labels, gradients, or fine-tuning. It operates under strict post-training constraints using only a small unlabeled calibration set. CORP formulates structured pruning as a representation recovery problem. It models removed activations and attention logits as affine functions of retained components and derives closed-form ridge regression solutions that fold compensation into model weights. This minimizes expected representation error under the calibration distribution. Experiments on ImageNet with DeiT models show strong redundancy in MLP and attention representations. Without compensation, one-shot structured pruning causes severe accuracy degradation. With CORP, models preserve accuracy under aggressive sparsity. On DeiT-Huge, CORP retains 82.8\% Top-1 accuracy after pruning 50\% of both MLP and attention structures. CORP completes pruning in under 20 minutes on a single GPU and delivers substantial real-world efficiency gains.
Paper Structure (41 sections, 36 equations, 3 figures, 4 tables, 5 algorithms)

This paper contains 41 sections, 36 equations, 3 figures, 4 tables, 5 algorithms.

Figures (3)

  • Figure 1: Illustration of structured pruning targets in Vision Transformers. (a) Attention head dimension pruning removes channel dimensions in the query and key projections without discarding entire heads. Clear regions indicate retained structures, while hatched regions denote pruned components. (b) MLP structured pruning removes entire hidden dimensions between the two linear layers.
  • Figure 2: Top-1 accuracy versus sparsity on DeiT-Base for MLP-only, Attention-only, and joint pruning. One-shot structured pruning without compensation leads to rapid accuracy degradation, while CORP consistently preserves accuracy across sparsity levels.
  • Figure 3: Top-1 accuracy comparison between activation-based and magnitude-based ranking with and without compensation. Results are shown for MLP pruning on DeiT-Small and DeiT-Base. Magnitude-based ranking performs slightly better at low sparsity with compensation, while activation-based ranking becomes more robust at higher sparsity.