DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

Pala Tej Deep; Rishabh Bhardwaj; Soujanya Poria

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

Pala Tej Deep, Rishabh Bhardwaj, Soujanya Poria

TL;DR

The paper tackles interference during merging of homologous, task-tuned models by introducing Della, a three-step Delta-based merger (Drop-Elect-Fuse) augmented with MAGPRUNE, a magnitude-based stochastic pruning strategy. The Drop step prunes delta parameters with inverse magnitude-based dropout and rescales remaining deltas, the Elect step selects consistent-sign deltas, and the Fuse step blends them with a tunable scaling factor $\lambda$. Theoretical analysis supports scaling to preserve embeddings, and extensive experiments across LM, Math, and Code domains show Della, especially with MAGPRUNE and row-wise ranking, typically surpasses baselines (Dare, Ties, TA) in performance, while providing practical benefits and release of source code. Key findings include the critical role of scaling, the effectiveness of MagPrune at high pruning ratios, and the advantage of constant $\lambda$ over adaptive scaling. The work advances efficient, interference-robust model merging for domain-specific expertise with practical implications for multi-task capability synthesis.

Abstract

With the proliferation of domain-specific models, model merging has emerged as a set of techniques that combine the capabilities of multiple models into one that can multitask without the cost of additional training. In this paper, we propose a new model merging technique, Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique, MAGPRUNE, which shows significant advantages over DARE and TIES. MAGPRUNE first ranks the parameters in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower ranks corresponding to lower magnitudes. To approximate the original embeddings, MAGPRUNE employs a rescaling operation on the parameters that survive the random dropping by 1/(1 - p). On three different expert models considered for merging (LM, Math, Code) and corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), DELLA shows an average improvement of 2.4 points over baseline methods employing delta parameter pruning (an improvement of 3.6 points over TIES, 1.2 points over DARE), and 11.1 points over the no-pruning baseline (TA). We release the source code at: https://github.com/declare-lab/della.

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

TL;DR

. Theoretical analysis supports scaling to preserve embeddings, and extensive experiments across LM, Math, and Code domains show Della, especially with MAGPRUNE and row-wise ranking, typically surpasses baselines (Dare, Ties, TA) in performance, while providing practical benefits and release of source code. Key findings include the critical role of scaling, the effectiveness of MagPrune at high pruning ratios, and the advantage of constant

over adaptive scaling. The work advances efficient, interference-robust model merging for domain-specific expertise with practical implications for multi-task capability synthesis.

Abstract

Paper Structure (33 sections, 9 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 6 figures, 7 tables, 1 algorithm.

Introduction
Methodology
Della
Step-1: Drop.
Step-2: Elect.
Step-3: Fuse.
MagPrune: Stochastic Magnitude-based Pruning
Dare and Ties as configuration of Della.
Theoretical Analysis
Experimental Setup
Expert Models.
Baselines.
Evaluation Metrics.
Experiments (w/o Elect).
Experiments (only Drop).
...and 18 more sections

Figures (6)

Figure 1: Methodology: Three Steps involved in Della. First step performs magnitude-based sampling of delta parameters (MagPrune), second step elects the parameters that will undergo merging operation, and the final step (Fuse) performs merging.
Figure 2: Mapping weights to inversely proportional to drop probabilities.
Figure 3: Performance vs Drop Rate comparison of Della (magnitude-based random drop) against baselines Dare (random drop) and Ties (magnitude-based deterministic drop).
Figure 4: Performance vs lambda for the math+code merge combination
Figure 5: Adaptive Scaling vs Constant lambda Scaling Della
...and 1 more figures

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

TL;DR

Abstract

DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)