Table of Contents
Fetching ...

DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

Wenlong Deng, Yize Zhao, Vala Vakilian, Minghui Chen, Xiaoxiao Li, Christos Thrampoulidis

TL;DR

This work addresses the redundancy and latency issues of storing multiple fine-tuned models by improving delta-parameter pruning (DPP), specifically extending the DARE method. It introduces DAREx, comprising DAREx-q (rescaling with $1/q$) and DAREx-L2 (AdamR-$L_2$ delta-regularization), to push effective pruning toward extreme rates while maintaining performance; per-layer rescaling and unlabeled-data proxies further enhance robustness, and AdamR-L2 enables in-training reduction of delta statistics prior to pruning. The study also revisits importance-based DPP (MP, WANDA), showing they can outperform random-based DPP under large delta magnitudes and proposing a practical pipeline to select the appropriate method under real-world constraints, including compatibility with LoRA and structural DPP. Empirically, DAREx-q and AdamR-L2 deliver substantial gains across encoder and decoder models, enabling up to 99% delta-parameter pruning with modest performance loss, which has meaningful implications for model serving, federated learning efficiency, and storage. Overall, the paper provides a unified framework that balances post-hoc pruning and in-training regularization to maximize the practicality and reach of DPP in real-world large-language-model deployments.

Abstract

Storing open-source fine-tuned models separately introduces redundancy and increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters--the differences between fine-tuned and pre-trained model weights--while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters. To push DARE's limits, we introduce DAREx (DARE the eXtreme), which features two algorithmic improvements: (1) DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates (e.g., >30 % on COLA and SST2 for encoder models, with even greater gains in decoder models), and (2) DAREx-L2, which combines DARE with AdamR, an in-training method that applies appropriate delta regularization before DPP. We also demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.

DARE the Extreme: Revisiting Delta-Parameter Pruning For Fine-Tuned Models

TL;DR

This work addresses the redundancy and latency issues of storing multiple fine-tuned models by improving delta-parameter pruning (DPP), specifically extending the DARE method. It introduces DAREx, comprising DAREx-q (rescaling with ) and DAREx-L2 (AdamR- delta-regularization), to push effective pruning toward extreme rates while maintaining performance; per-layer rescaling and unlabeled-data proxies further enhance robustness, and AdamR-L2 enables in-training reduction of delta statistics prior to pruning. The study also revisits importance-based DPP (MP, WANDA), showing they can outperform random-based DPP under large delta magnitudes and proposing a practical pipeline to select the appropriate method under real-world constraints, including compatibility with LoRA and structural DPP. Empirically, DAREx-q and AdamR-L2 deliver substantial gains across encoder and decoder models, enabling up to 99% delta-parameter pruning with modest performance loss, which has meaningful implications for model serving, federated learning efficiency, and storage. Overall, the paper provides a unified framework that balances post-hoc pruning and in-training regularization to maximize the practicality and reach of DPP in real-world large-language-model deployments.

Abstract

Storing open-source fine-tuned models separately introduces redundancy and increases response times in applications utilizing multiple models. Delta-parameter pruning (DPP), particularly the random drop and rescale (DARE) method proposed by Yu et al., addresses this by pruning the majority of delta parameters--the differences between fine-tuned and pre-trained model weights--while typically maintaining minimal performance loss. However, DARE fails when either the pruning rate or the magnitude of the delta parameters is large. We highlight two key reasons for this failure: (1) an excessively large rescaling factor as pruning rates increase, and (2) high mean and variance in the delta parameters. To push DARE's limits, we introduce DAREx (DARE the eXtreme), which features two algorithmic improvements: (1) DAREx-q, a rescaling factor modification that significantly boosts performance at high pruning rates (e.g., >30 % on COLA and SST2 for encoder models, with even greater gains in decoder models), and (2) DAREx-L2, which combines DARE with AdamR, an in-training method that applies appropriate delta regularization before DPP. We also demonstrate that DAREx-q can be seamlessly combined with vanilla parameter-efficient fine-tuning techniques like LoRA and can facilitate structural DPP. Additionally, we revisit the application of importance-based pruning techniques within DPP, demonstrating that they outperform random-based methods when delta parameters are large. Through this comprehensive study, we develop a pipeline for selecting the most appropriate DPP method under various practical scenarios.

Paper Structure

This paper contains 43 sections, 2 theorems, 24 equations, 12 figures, 16 tables, 3 algorithms.

Key Result

Theorem 3.1

Denote $h_i^{\rm{diff}}$ as the $i$-th component of $\bm{h}^{\rm{diff}}$ in Eq. eq:hdiff dare. For $i \in [m], j \in [n]$, let $c_{ij} = \Delta W_{ij} x_j$ represent the change in influence of the $j$-th feature on the $i$-th output neuron after fine-tuning. DefineWe call 'mean'/'variance' here the where $\Psi(p)= (1-2p)/\log((1-p)/p)$ if $p\leq 1/2$, otherwise $\Psi(p)=\sqrt{2p(1-p)}$.

Figures (12)

  • Figure 1: Our DAREx-q improves on DARE's performance by tuning the rescaling factor $1/q$. Experiments with BERT at a pruning rate of $p=0.99$. Right: The optimal rescaling factor (asterisk), which maximizes test performance, differs from the standard $1/(1-p)$ across all four datasets and yields up to $>10$-fold gains. Left:The rescaling factor that minimizes the last-layer output change, averaged over last-layer neurons, serves as an excellent proxy for the optimal factor maximizing performance and can be determined by inference on a single training batch.
  • Figure 2: Across pruning rates $p$, DAREx-q performs at least as well as vanilla DARE and significantly outperforms it at higher pruning rates.
  • Figure 3: Applying DARE on decoder models finetuned with AdamR-$L_2$ (DAREx-$L_2$) at varying regularization strengths demonstrates significant performance improvements for $p\geq 0.9$.
  • Figure 4: Flowchart for selecting appropriate DPP methods based on different scenarios.
  • Figure 5: Controlled experiments of DPP performance on two-layer neural net. (a) Influence of variance and mean statistics on DARE. (b) Influence of normalization layer. (c) $L_1$ regularization for importance based pruning. (d) Methods with best-fitting regularization.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • Theorem E.1: kearns