Table of Contents
Fetching ...

Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation

Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann

TL;DR

ACIP introduces Any Compression via Iterative Pruning, a post-training framework that decouples pruning from compression to enable real-time materialization of models at any target size. By reparametrizing dense linear layers with SVD, adding low-rank adapters, and learning a global score map through iterative $ ext{l}_1$-regularized pruning, ACIP can flexibly prune singular values to any desired compression level without re-calibration. The approach yields state-of-the-art results against factorization-based baselines across multiple open-weight LLMs and tasks, and it synergizes with quantization strategies to further reduce memory footprints. ACIP thus offers a practical, scalable tool for deploying large language models in resource-constrained settings, with broad potential for integration with existing compression pipelines.

Abstract

The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To achieve parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. Importantly, the pruning order of the parameters is used to derive a global score map that allows compressing a model to any target size without re-computation. We evaluate ACIP on a large selection of open-weight LLMs and downstream tasks, demonstrating state-of-the-art results compared to existing factorization-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.

Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation

TL;DR

ACIP introduces Any Compression via Iterative Pruning, a post-training framework that decouples pruning from compression to enable real-time materialization of models at any target size. By reparametrizing dense linear layers with SVD, adding low-rank adapters, and learning a global score map through iterative -regularized pruning, ACIP can flexibly prune singular values to any desired compression level without re-calibration. The approach yields state-of-the-art results against factorization-based baselines across multiple open-weight LLMs and tasks, and it synergizes with quantization strategies to further reduce memory footprints. ACIP thus offers a practical, scalable tool for deploying large language models in resource-constrained settings, with broad potential for integration with existing compression pipelines.

Abstract

The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To achieve parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. Importantly, the pruning order of the parameters is used to derive a global score map that allows compressing a model to any target size without re-computation. We evaluate ACIP on a large selection of open-weight LLMs and downstream tasks, demonstrating state-of-the-art results compared to existing factorization-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.

Paper Structure

This paper contains 49 sections, 10 equations, 18 figures, 6 tables, 1 algorithm.

Figures (18)

  • Figure 1: Compared to conventional compression algorithms (a), an Any Compression algorithm (b) swaps the computational calibration step and the decision step, so that models of different target sizes can be materialized without re-computation.
  • Figure 1: The procedure for updating the score map after each optimization step. For unpruned mask parameters $\mathbf{p}$, the score is their current magnitude, which estimates the future pruning order. Once a parameter is pruned, its score becomes a negative integer value that tracks the pruning history by decrementing at each step. This dual mechanism establishes a global importance ranking used to compress the model to any target size in Step 3.
  • Figure 2: A visual overview of ACIP. The linear layers of the base model are reparametrized in terms of their singular value decomposition $\mathbf{U} \mathbf{M} \boldsymbol\Sigma \mathbf{V}^\top$ , with a (binary) singular value mask $\mathbf{M} = \mathbf{M}(\mathbf{p})$ and a low-rank adapter $\boldsymbol\Delta$. An objective function is optimized via gradient descent over the mask parameters $\mathbf{p}$ and adapters $\boldsymbol\Delta$, where sparsity is induced on $\mathbf{p}$ by an increasing $\ell_1$-penalty. This leads to pruned entries in the mask $\mathbf{M}(\mathbf{p})$. The optimization path of $\mathbf{p}$ gives rise to a score map that determines the global importance of the singular values across the full model. Potential compression errors are compensated by $\boldsymbol\Delta$. Based on the parameter scores, the base model can be flexibly compressed to any target size by masking the entries of $\boldsymbol\Sigma$. The learned adapters $\boldsymbol\Delta$ are used as correction for any compression level.
  • Figure 3: Progressive shrinkage of exemplary mask para-meters $\mathbf{p}$ in Attn-V layer $l=30$ of LLaMA-7B based on \ref{['eq:shrinkage']}. Each plotted line corresponds to the evolution of a parameter value over training time. The starting points of shrinkage are predictive of the pruning order, a typical phenomenon in $\ell_1$-regularization. In ACIP, this pruning order determines the score of associated singular values $\mathbf{s}$ (cf. \ref{['alg:score_update']}).
  • Figure 4: Compression-performance trade-offs generated by ACIP on C4. Each curve was obtained by the Any Compression stage (Step 3 in \ref{['sec:compression']}), i.e., no additional computation was required except for a perplexity evaluation. Square marks denote the base model performance.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Remark 2.1: Scaling of $\lambda$
  • Remark 2.2: Role of the Calibration Loss
  • Remark 2.3: Merging Low-Rank Adapters