Table of Contents
Fetching ...

Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

Chenyu Huang, Peng Ye, Xiaohui Wang, Shenghe Zheng, Biqing Qi, Lei Bai, Wanli Ouyang, Tao Chen

TL;DR

This work tackles the storage burden of task-specific finetuned models by proposing Delta-DCT, a data-free delta compression method that operates in the Discrete Cosine Transform (DCT) domain. It groups delta parameters into patches, ranks patch importance via the $L_2$ norm, assigns mixed-precision bit-widths, and quantizes in the DCT domain before reconstructing with IDCT, all without data or training. Across diverse models—LLMs from $7$B to $13$B, smaller language models, vision transformers, and multi-modal BEiT-3—the method achieves performance comparable to or better than finetuned models at a $1$-bit-equivalent compression ratio, outperforming prior data-dependent baselines such as BitDelta and Delta-CoMe. The results demonstrate a practical, scalable approach for on-device delta compression, with a modest storage overhead and high potential for parallelization to reduce compute time.

Abstract

With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform

TL;DR

This work tackles the storage burden of task-specific finetuned models by proposing Delta-DCT, a data-free delta compression method that operates in the Discrete Cosine Transform (DCT) domain. It groups delta parameters into patches, ranks patch importance via the norm, assigns mixed-precision bit-widths, and quantizes in the DCT domain before reconstructing with IDCT, all without data or training. Across diverse models—LLMs from B to B, smaller language models, vision transformers, and multi-modal BEiT-3—the method achieves performance comparable to or better than finetuned models at a -bit-equivalent compression ratio, outperforming prior data-dependent baselines such as BitDelta and Delta-CoMe. The results demonstrate a practical, scalable approach for on-device delta compression, with a modest storage overhead and high potential for parallelization to reduce compute time.

Abstract

With transformer-based models and the pretrain-finetune paradigm becoming mainstream, the high storage and deployment costs of individual finetuned models on multiple tasks pose critical challenges. Delta compression attempts to lower the costs by reducing the redundancy of delta parameters (i.e., the difference between the finetuned and pre-trained model weights). However, existing methods usually face problems including data accessibility and training requirements. To tackle this issue, we introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT). We first (a) group delta parameters within a layer into patches. Then we (b) assess the importance of each patch and allocate them with different quantization bit-widths. Afterwards, we (c) convert these patches to the DCT domain and conduct quantization to each patch based on the allocated bit-width. The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.

Paper Structure

This paper contains 26 sections, 10 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: Compression methods comparison of (a) BitDelta liu2024bitdelta, which needs to finetune the scale factors; (b) Delta-CoMe ping2024deltacome, which employs calibration-data-requiring GPTQ frantar2022optq; and (c) Our Delta-DCT, which applies multi-precision patchwise quantization to the DCT converted patches via importance-based bit-width allocation, which requires no data or training.
  • Figure 2: (a) The overview of the proposed data-free delta compression framework. We first divide the delta parameter into patches by patch size $p$. Then, we calculate the $\mathcal{L}_2$ Norm values of each patch as the importance assessment, as shown in (b). Based on the importance scores, we allocate different bit-widths to different patches via (c). Meanwhile, we apply Discrete Cosine Transform (DCT) to each patch, and further conduct mixed precision quantization to obtain the compressed data based on (d). During the inference stage, we first conduct linear mapping to the compressed delta parameters, and then inverse DCT (IDCT) and rescaling are conducted respectively for reconstruction, as shown in (e).
  • Figure 3: Visualization of delta parameter distribution of a layer in the (a) Finetuned model, (b) BitDelta liu2024bitdelta compressed model, (c) Delta-CoMe compressed model, and our Delta-DCT compressed model under the patch size setting of (d) $p=16$ and (e) $p=8$. Existing methods usually cause obvious delta parameter distribution offsets while the delta parameter distribution of our Delta-DCT is almost the same as that of the finetuned model.
  • Figure 4: Grad-CAM visualization results of ViT-B/32 models compressed by different delta compression methods. Existing methods usually cause shifts in the focused areas while our Delta-DCT with different patch sizes consistently focuses on the area closest to the finetuned model.
  • Figure 5: Case Study of a coding task for different delta compression methods. Only our Delta-DCT outputs the correct answer.
  • ...and 2 more figures