Table of Contents
Fetching ...

ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs

Yan Yang, Yixia Li, Hongru Wang, Xuetao Wei, Jianqiao Yu, Yun Chen, Guanhua Chen

TL;DR

ImPart introduces an importance-aware delta-sparsification framework that uses singular value decomposition to allocate sparsity across singular vectors according to their importance, preserving task-specific knowledge at high compression. The method includes a principled sparsity allocation strategy, a theoretical justification for unbiased reconstruction, and practical integrations with delta-quantization (ImPart-Qt) and model merging (TA and TIES). Empirical results across mathematics, code generation, and chat tasks show ImPart achieving state-of-the-art delta sparsification, with about $2\times$ higher compression at the same performance and improved results when combined with quantization and merging. The work demonstrates strong potential for deploying and merging many fine-tuned LLMs in resource-constrained environments, while outlining limitations and avenues for future improvements such as layer-wise sparsification and validation-set dependence.

Abstract

With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating $2\times$ higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.

ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs

TL;DR

ImPart introduces an importance-aware delta-sparsification framework that uses singular value decomposition to allocate sparsity across singular vectors according to their importance, preserving task-specific knowledge at high compression. The method includes a principled sparsity allocation strategy, a theoretical justification for unbiased reconstruction, and practical integrations with delta-quantization (ImPart-Qt) and model merging (TA and TIES). Empirical results across mathematics, code generation, and chat tasks show ImPart achieving state-of-the-art delta sparsification, with about higher compression at the same performance and improved results when combined with quantization and merging. The work demonstrates strong potential for deploying and merging many fine-tuned LLMs in resource-constrained environments, while outlining limitations and avenues for future improvements such as layer-wise sparsification and validation-set dependence.

Abstract

With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.

Paper Structure

This paper contains 57 sections, 9 equations, 4 figures, 8 tables, 3 algorithms.

Figures (4)

  • Figure 1: Comparative evaluation of ImPart against state-of-the-art sparsification methods across mathematical reasoning, code generation, and chat tasks. ImPart consistently outperforms baselines across various tasks while maintaining high sparsity ratios (more detailed discussions are in Section \ref{['sec:diff_sparse_ratio']}).
  • Figure 2: Overview of ImPart. (a) Delta parameters computation by subtracting the base model from the fine-tuned model. (b) Comparison of delta parameters sparsification methods: DARE randomly drops delta parameters, LowRank sparsifies with low-rank approximation, and ImPart adaptively sparsifies singular vectors. (c) Further apply mixed-precision quantization on sparse singular vectors to achieve higher compression ratios. (d) Model merging by combining sparsified delta parameters to build a unified multi-task model.
  • Figure 3: Importance-aware delta-sparsification adaptively sets sparsity ratios based on singular values, ensuring critical information retention. ImPart first pre-prunes small singular components and then allocates sparsity budget based on regularized singular values.
  • Figure 4: Comparative evaluation of ImPart against state-of-the-art quantization methods across mathematical reasoning, code generation, and chat tasks (more detailed discussions are in Section \ref{['sec:exp_quant']}).