ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs
Yan Yang, Yixia Li, Hongru Wang, Xuetao Wei, Jianqiao Yu, Yun Chen, Guanhua Chen
TL;DR
ImPart introduces an importance-aware delta-sparsification framework that uses singular value decomposition to allocate sparsity across singular vectors according to their importance, preserving task-specific knowledge at high compression. The method includes a principled sparsity allocation strategy, a theoretical justification for unbiased reconstruction, and practical integrations with delta-quantization (ImPart-Qt) and model merging (TA and TIES). Empirical results across mathematics, code generation, and chat tasks show ImPart achieving state-of-the-art delta sparsification, with about $2\times$ higher compression at the same performance and improved results when combined with quantization and merging. The work demonstrates strong potential for deploying and merging many fine-tuned LLMs in resource-constrained environments, while outlining limitations and avenues for future improvements such as layer-wise sparsification and validation-set dependence.
Abstract
With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating $2\times$ higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
