DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models
Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li, Christos Thrampoulidis
TL;DR
This work addresses the redundancy and latency issues of storing multiple fine-tuned models by improving delta-parameter pruning (DPP), specifically extending the DARE method. It introduces DAREx, comprising DAREx-q (rescaling with $1/q$) and DAREx-L2 (AdamR-$L_2$ delta-regularization), to push effective pruning toward extreme rates while maintaining performance; per-layer rescaling and unlabeled-data proxies further enhance robustness, and AdamR-L2 enables in-training reduction of delta statistics prior to pruning. The study also revisits importance-based DPP (MP, WANDA), showing they can outperform random-based DPP under large delta magnitudes and proposing a practical pipeline to select the appropriate method under real-world constraints, including compatibility with LoRA and structural DPP. Empirically, DAREx-q and AdamR-L2 deliver substantial gains across encoder and decoder models, enabling up to 99% delta-parameter pruning with modest performance loss, which has meaningful implications for model serving, federated learning efficiency, and storage. Overall, the paper provides a unified framework that balances post-hoc pruning and in-training regularization to maximize the practicality and reach of DPP in real-world large-language-model deployments.
Abstract
Storing open-source fine-tuned models separately introduces redundancy and increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters--the differences between fine-tuned and pre-trained model weights--while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters. To push DARE's limits, we introduce DAREx (DARE the eXtreme), which features two algorithmic improvements: (1) DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates (e.g., >30 % on COLA and SST2 for encoder models, with even greater gains in decoder models), and (2) DAREx-L2, which combines DARE with AdamR, an in-training method that applies appropriate delta regularization before DPP. We also demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.
