Choose Your Model Size: Any Compression of Large Language Models Without Re-Computation
Martin Genzel, Patrick Putzky, Pengfei Zhao, Sebastian Schulze, Mattes Mollenhauer, Robert Seidel, Stefan Dietzel, Thomas Wollmann
TL;DR
ACIP introduces Any Compression via Iterative Pruning, a post-training framework that decouples pruning from compression to enable real-time materialization of models at any target size. By reparametrizing dense linear layers with SVD, adding low-rank adapters, and learning a global score map through iterative $ ext{l}_1$-regularized pruning, ACIP can flexibly prune singular values to any desired compression level without re-calibration. The approach yields state-of-the-art results against factorization-based baselines across multiple open-weight LLMs and tasks, and it synergizes with quantization strategies to further reduce memory footprints. ACIP thus offers a practical, scalable tool for deploying large language models in resource-constrained settings, with broad potential for integration with existing compression pipelines.
Abstract
The adoption of Foundation Models in resource-constrained environments remains challenging due to their large size and inference costs. A promising way to overcome these limitations is post-training compression, which aims to balance reduced model size against performance degradation. This work presents Any Compression via Iterative Pruning (ACIP), a novel algorithmic approach to determine a compression-performance trade-off from a single stochastic gradient descent run. To achieve parameter efficiency, we use an SVD-reparametrization of linear layers and iteratively prune their singular values with a sparsity-inducing penalty. Importantly, the pruning order of the parameters is used to derive a global score map that allows compressing a model to any target size without re-computation. We evaluate ACIP on a large selection of open-weight LLMs and downstream tasks, demonstrating state-of-the-art results compared to existing factorization-based compression methods. We also show that ACIP seamlessly complements common quantization-based compression techniques.
