Table of Contents
Fetching ...

Compression Laws for Large Language Models

Ayan Sengupta, Siddhant Chaudhary, Tanmoy Chakraborty

TL;DR

This work introduces compression laws to model how structured model pruning and recovery fine-tuning affect LLM performance after compression, extending scaling-law thinking to post-training efficiency. The authors propose a three-exponent power-law, $\mathcal{L} = \mathcal{L}_0^{\alpha}(1+r)^{\beta}\left(1 + \frac{1}{D+1}\right)^{\gamma}$, fit via ordinary least squares after log transformation, and examine a critical compression ratio that governs recoverability via recovery fine-tuning (RFT). Through over 1000 experiments on Qwen-2.5 and LLaMA-3 models across intrinsic and extrinsic tasks, they show intrinsic loss grows steeply with $r$ while extrinsic accuracy degrades more mildly, and that RFT can substantially mitigate loss, especially for larger models. Inference speedups can reach up to ~60% for the largest models at high compression, supporting substantial compute savings, though benefits diminish for smaller models; calibration-free methods generally yield stronger extrinsic recovery, while calibration-based methods offer more intrinsic stability. The findings yield practical guidelines for deploying compressed LLMs in resource-constrained settings and lay groundwork for adaptive, task-aware, and hybrid compression strategies.

Abstract

We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over $1000$ experiments across eight models with sizes ranging from $0.5B$ to $14B$ parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models ($\le 7B$), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.

Compression Laws for Large Language Models

TL;DR

This work introduces compression laws to model how structured model pruning and recovery fine-tuning affect LLM performance after compression, extending scaling-law thinking to post-training efficiency. The authors propose a three-exponent power-law, , fit via ordinary least squares after log transformation, and examine a critical compression ratio that governs recoverability via recovery fine-tuning (RFT). Through over 1000 experiments on Qwen-2.5 and LLaMA-3 models across intrinsic and extrinsic tasks, they show intrinsic loss grows steeply with while extrinsic accuracy degrades more mildly, and that RFT can substantially mitigate loss, especially for larger models. Inference speedups can reach up to ~60% for the largest models at high compression, supporting substantial compute savings, though benefits diminish for smaller models; calibration-free methods generally yield stronger extrinsic recovery, while calibration-based methods offer more intrinsic stability. The findings yield practical guidelines for deploying compressed LLMs in resource-constrained settings and lay groundwork for adaptive, task-aware, and hybrid compression strategies.

Abstract

We introduce compression laws for language language models (LLMs). While recent scaling laws have sought to understand how LLMs scale with respect to model size, pre-training data, and computational resources, we focus on understanding how model compression affects the performance of a pre-trained LLM on downstream tasks. We empirically examine the effects of structured model compression on LLMs through over experiments across eight models with sizes ranging from to parameters. Our findings indicate that the test cross-entropy loss increases quadratically with the compression ratio, whereas performance on downstream tasks declines only linearly. Our study emphasizes the importance of recovery fine-tuning in enhancing generation loss, showing that the test loss of compressed LLMs can improve by up to 55% with recovery fine-tuning. At higher compression ratios (up to 90%), compressed LLMs demonstrate a speed increase of 60% during inference compared to their uncompressed counterparts, compensating for the performance degradation at this level. However, for smaller models (), the computational gains are limited, peaking at just 35%. We conclude that model compression can be highly beneficial for larger models, especially when a smaller model within the same computational budget is not available. These insights provide the practical guidelines for utilizing model compression techniques for adopting LLMs in real-life applications in resource-constrained settings.

Paper Structure

This paper contains 26 sections, 2 theorems, 11 equations, 12 figures, 5 tables.

Key Result

Theorem 3.1

Consider the compression law $\mathcal{L} = \mathcal{L}_0^\alpha(1 + r)^\beta\left(1 + \frac{1}{D + 1}\right)^\gamma$ for a model class, where $\mathcal{L}$ and $\mathcal{L}_0$ represent the accuracy of the compressed and the base models, respectively. Further, assume that the scaling law satisfies

Figures (12)

  • Figure 1: Zero-shot accuracy of compressed Qwen and LLaMA models without (left) and with (right) recovery fine-tuning for calibration-free model compression (see Figure \ref{['fig:extrinsic_slicegpt']} in Appendix \ref{['appx:intrinsic_extrinsic_results']} for calibration results) on different extrinsic tasks.
  • Figure 2: Inference speedup of compressed Qwen-14B and LLaMA-8B (the two largest models used in the study) models compared to the corresponding uncompressed models. At higher compression ratios, extrinsic performance declines significantly (over 40%) for large models ($>$7B parameters). However, the inference speedup compensates for this performance drop.
  • Figure 3: Test loss (intrinsic evaluation) with compressed LLMs (calibration-free) without (left) and with (right) recovery fine-tuning (with-calibration results are shown in Figure \ref{['fig:intrinsic_results_slicegpt']} of Appendix \ref{['appx:intrinsic_extrinsic_results']}).
  • Figure 4: Fit of intrinsic (a) and extrinsic (b) compression laws for different LLMs at different compression ratios using the calibration-free method. Different lines indicate different $\mathcal{L}_0$ frontiers. Impact of recovery fine-tuning on the intrinsic (c) and extrinsic (d) performance of compressed LLMs using the calibration-free method. Figure \ref{['fig:main_fit_calibration']} of Appendix \ref{['appx:calibration_results']} highlights the compression laws with calibration-based compression method.
  • Figure 5: Critical compression ratio for different model sizes for intrinsic (a) and extrinsic (b) performances. High critical compression ratio indicates that an LLM can retain performance even when compressed extremely.
  • ...and 7 more figures

Theorems & Definitions (4)

  • Theorem 3.1
  • Corollary 3.2
  • proof
  • proof