Table of Contents
Fetching ...

ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

Wenshuo Li, Xinghao Chen, Han Shu, Yehui Tang, Yunhe Wang

TL;DR

ExCP tackles the heavy storage burden of training checkpoints for large language models by integrating residual checkpoint encoding, joint pruning of weights and optimizer momentum, and non-uniform quantization. It defines a residual framework ${\Delta \mathcal{P}_t = \{\Delta \mathcal{W}_t, \mathcal{O}_t\} = \{\mathcal{W}_t-\mathcal{W}_{t-1}, \mathcal{O}_t\}}$ and employs thresholds $r_w$ and $r_o$ to prune weights and momentum with convergence guarantees in the Adam setting. Quantization then clusters values into $2^n-1$ centers (plus a zero center), storing centers $\mathcal{C}_t$ and indices $\mathcal{I}_t$, enabling highly compact checkpoints with near-lossless resume capability. Experiments across ViT-L32, Pythia-410M, and PanGu-$\pi$ models demonstrate compression factors from ~25x to ~70x with negligible downstream performance loss, suggesting substantial practical impact for reducing storage costs in LLM training pipelines. The work also provides ablation evidence that residuals, joint pruning, and quantization are all essential components for attaining the best trade-offs.

Abstract

Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence. However, the training process of these models poses significant challenges in terms of computational and storage capacities, thus compressing checkpoints has become an urgent problem. In this paper, we propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints while achieving nearly lossless performance. We first calculate the residuals of adjacent checkpoints to obtain the essential but sparse information for higher compression ratio. To further excavate the redundancy parameters in checkpoints, we then propose a weight-momentum joint shrinking method to utilize another important information during the model optimization, i.e., momentum. In particular, we exploit the information of both model and optimizer to discard as many parameters as possible while preserving critical information to ensure optimal performance. Furthermore, we utilize non-uniform quantization to further compress the storage of checkpoints. We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance. For instance, we achieve approximately $70\times$ compression for the Pythia-410M model, with the final performance being as accurate as the original model on various downstream tasks. Codes will be available at https://github.com/Gaffey/ExCP.

ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

TL;DR

ExCP tackles the heavy storage burden of training checkpoints for large language models by integrating residual checkpoint encoding, joint pruning of weights and optimizer momentum, and non-uniform quantization. It defines a residual framework and employs thresholds and to prune weights and momentum with convergence guarantees in the Adam setting. Quantization then clusters values into centers (plus a zero center), storing centers and indices , enabling highly compact checkpoints with near-lossless resume capability. Experiments across ViT-L32, Pythia-410M, and PanGu- models demonstrate compression factors from ~25x to ~70x with negligible downstream performance loss, suggesting substantial practical impact for reducing storage costs in LLM training pipelines. The work also provides ablation evidence that residuals, joint pruning, and quantization are all essential components for attaining the best trade-offs.

Abstract

Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence. However, the training process of these models poses significant challenges in terms of computational and storage capacities, thus compressing checkpoints has become an urgent problem. In this paper, we propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints while achieving nearly lossless performance. We first calculate the residuals of adjacent checkpoints to obtain the essential but sparse information for higher compression ratio. To further excavate the redundancy parameters in checkpoints, we then propose a weight-momentum joint shrinking method to utilize another important information during the model optimization, i.e., momentum. In particular, we exploit the information of both model and optimizer to discard as many parameters as possible while preserving critical information to ensure optimal performance. Furthermore, we utilize non-uniform quantization to further compress the storage of checkpoints. We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance. For instance, we achieve approximately compression for the Pythia-410M model, with the final performance being as accurate as the original model on various downstream tasks. Codes will be available at https://github.com/Gaffey/ExCP.
Paper Structure (17 sections, 2 theorems, 17 equations, 10 figures, 7 tables, 3 algorithms)

This paper contains 17 sections, 2 theorems, 17 equations, 10 figures, 7 tables, 3 algorithms.

Key Result

Theorem 3.1

According the convergence analysis in Adam kingma2014adam, assume that the function $f_t$ has bounded gradients, $\left\|\nabla f_t(\theta)\right\|_2 \leq G,\left\|\nabla f_t(\theta)\right\|_{\infty} \leq$$G_{\infty}$ for all $\theta \in R^d$ and distance between any $\theta_t$ generated by Adam is

Figures (10)

  • Figure 1: The number of parameters of some LLMs and the general training process of LLMs. (a) Parameters of some recent LLMs, most of them contain billions of weights and keep getting larger in trend. (b) The training of LLMs consists of several stages with variety of schemes and data. A large quantity of checkpoints would be stored in each stage. Considering the magnitude of LLMs’ parameters, extremely high capacity storage is needed for training of LLMs, which could cost tens of millions of dollars.
  • Figure 2: Framework of our proposed compression process. We calculate the residual $\Delta \mathcal{W}_t$ and apply joint-pruning on $\Delta \mathcal{W}_t$ and $\mathcal{O}_t$. Then we quantize them separately and save the cluster center $\mathcal{C}_t$ and cluster index $\mathcal{I}_t$.
  • Figure 3: Weights distribution for original weights, pruning on residual checkpoints and pruning on original weights. We plot the histogram of random 100k non-zero weights of each case for clarity. The range of bins are bounded by (mean - 3 * std, mean + 3 * std) and 256 bins are used.
  • Figure 4: Comparison of training loss and checkpoint size between original models and our methods.
  • Figure 5: Q&A example to show the difference between our compressed model and the original model.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem 1.1
  • proof