Table of Contents
Fetching ...

Cut Less, Fold More: Model Compression through the Lens of Projection Geometry

Olga Saukh, Dong Wang, Haris Šikić, Yun Cheng, Lothar Thiele

TL;DR

This work formalizes structured pruning and model folding as orthogonal operators and shows that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning.

Abstract

Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate-high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.

Cut Less, Fold More: Model Compression through the Lens of Projection Geometry

TL;DR

This work formalizes structured pruning and model folding as orthogonal operators and shows that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning.

Abstract

Compressing neural networks without retraining is vital for deployment at scale. We study calibration-free compression through the lens of projection geometry: structured pruning is an axis-aligned projection, whereas model folding performs a low-rank projection via weight clustering. We formalize both as orthogonal operators and show that, within a rank distance of one, folding provably yields smaller parameter reconstruction error, and under mild smoothness assumptions, smaller functional perturbations than pruning. At scale, we evaluate >1000 checkpoints spanning ResNet18, PreActResNet18, ViT-B/32, and CLIP ViT-B/32 on CIFAR-10 and ImageNet-1K, covering diverse training hyperparameters (optimizers, learning rates, augmentations, regularization, sharpness-aware training), as well as multiple LLaMA-family 60M and 130M parameter models trained on C4. We show that folding typically achieves higher post-compression accuracy, with the largest gains at moderate-high compression. The gap narrows and occasionally reverses at specific training setups. Our results position folding as a geometry-aware, calibration-free alternative to pruning that is often superior in practice and principled in theory.
Paper Structure (28 sections, 4 theorems, 18 equations, 28 figures, 10 tables)

This paper contains 28 sections, 4 theorems, 18 equations, 28 figures, 10 tables.

Key Result

Theorem 2.1

Given any pruning with basis $\mathbf{U}_p$ of rank $0 \leq k_p \leq m-1$ (i.e., at least one parameter vector is pruned), there exists a folding with basis $\mathbf{U}'_f$ and rank $k_f = k_p + 1$ such that where $\mathbf{W}_p = \mathbf{C}_p \mathbf{W}$ and $\mathbf{W}'_f = \mathbf{C}'_f \mathbf{W}$, with $\mathbf{C}_p$ and $\mathbf{C}'_f$ denoting the orthogonal projections defined in Eq. eq:pr

Figures (28)

  • Figure 1: Folding outperforms magnitude pruning across diverse training regimes.Top row: ResNet18 and PreActResNet18 on CIFAR-10. ResNet18 checkpoints were trained from scratch with Adam using different hyperparameter configurations. PreActResNet18 checkpoints are from andriushchenko2023modernlookrelationshipsharpness. Bottom row: ViT-B/32 on CIFAR-10 from andriushchenko2023modernlookrelationshipsharpness and CLIP ViT-B/32 on ImageNet-1K from wortsman2022modelsoupsaveragingweights. See Appendix \ref{['appx:hyperparameters']} for details. In these plots, we use checkpoints that were trained without L1 regularization. Scatter plots show post-compression accuracy for magnitude pruning (L1 criterion) versus folding at uniform per-layer compression ratios (color-coded by layer-wise compression ratio). Bar plots depict the accuracy gain by folding, computed as $\Delta=\mathrm{Acc}{(\text{FOLD\xspace})}-\mathrm{Acc}{(\text{MAG1\xspace})}$, as a function of layer-wise compression ratio. Folding yields the largest improvements at moderate to high compression, confirming its robustness across architectures and datasets. Fig. \ref{['fig:accuracy_FOLD_vs_MAG_L2']} shows the results for magnitude pruning with L2 criterion.
  • Figure 2: MAG1 versus FOLD on ViTs after LayerNorm-only fine-tuning for ViT-B/32 on CIFAR-10 and CLIP ViT-B/32 on ImageNet-1K. In the scatter plots, points are checkpoints, color encodes layer-wise compression. Bar plots depict the accuracy gain $\Delta=\mathrm{Acc}{(\text{FOLD\xspace})}-\mathrm{Acc}{(\text{MAG1\xspace})}$, which remains positive and typically grows with compression, indicating that even under lightweight LayerNorm adaptation FOLD retains a consistent advantage over pruning.
  • Figure 3: Folded models retain their accuracy advantage after fine-tuning. Results for ResNet18 trained by Adam on CIFAR-10 (top row) and CLIP-ViT-B/32 trained on ImageNet-1K (bottom row): (a,d) compares post-compression accuracy of magnitude pruning (MAG1) versus folding (FOLD) after 1 and 5 epochs of fine-tuning. (b,e) show the accuracy gap between folding and pruning as a function of fine-tuning epochs, demonstrating that folding maintains a consistent lead, i.e., the FOLD accuracy delta is positive. (c,f) illustrate accuracy trajectories before and after 5 epochs of fine-tuning for both methods, highlighting that folded models recover accuracy faster. Further results in Appendix \ref{['appx:further_results']}.
  • Figure 4: Optimizer effect evaluated on ResNet18 checkpoints trained on CIFAR-10 with SGD (no L1 regularization). The figure complements Fig. \ref{['fig:accuracy_FOLD_vs_MAG_L1']}(a).
  • Figure 5: Learning rate modulates folding’s edge. Post-compression accuracy of FOLD and MAG1 across learning rates: ResNet18 with Adam (a) and SGD (b), PreActResNet18 (c), and ViT-B/32 (d). FOLD leads at moderate–low rates. With Adam, the gap shrinks or reverses at very high rates, and closes again at extremely small rates. SGD shows weaker or opposite dependence.
  • ...and 23 more figures

Theorems & Definitions (6)

  • Theorem 2.1
  • Theorem 2.2
  • Theorem 2.1
  • proof
  • Theorem 2.2
  • proof