Table of Contents
Fetching ...

Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

Yifei He, Yuzheng Hu, Yong Lin, Tong Zhang, Han Zhao

TL;DR

The paper tackles interference in global model merging of multiple finetuned models by introducing Localize-and-Stitch, which localizes essential finetuning skills into tiny sparse regions (about $1\%$ of parameters) and stitches them back onto the pretrained base. Localization uses a learnable mask $\gamma_i=\sigma(S_i)$ optimized with an $L_1$-regularized objective to preserve task performance, yielding interpretable, minimal regions that minimize cross-task conflicts. Stitching aggregates masked task vectors so that $\theta_{\text{merged}}=\theta_{\text{pre}}+\sum_i (\gamma_i'\odot\tau_i)$, with overlaps handled by averaging and no hyperparameter tuning, enabling data-efficient merging and straightforward continual learning. Empirically, the method achieves state-of-the-art or competitive multi-task performance across NLP, vision, and decoder-based language models, even in dataless settings, and enables substantial model compression (down to ~1% of original storage) while preserving pretrained knowledge and supporting scalable skill composition with minimal storage and compute.

Abstract

Model merging offers an effective strategy to combine the strengths of multiple finetuned models into a unified model that preserves the specialized capabilities of each. Existing methods merge models in a global manner, performing arithmetic operations across all model parameters. However, such global merging often leads to task interference, degrading the performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel approach that merges models in a localized way. Our algorithm works in two steps: i) Localization: identify tiny ($1\%$ of the total parameters) localized regions in the finetuned models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate only these essential regions back into the pretrained model for task synergy. We demonstrate that our approach effectively locates sparse regions responsible for finetuned performance, and the localized regions could be treated as compact and interpretable representations of the finetuned models (tasks). Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. Beyond strong empirical performance, our algorithm also facilitates model compression and preserves pretrained knowledge, enabling flexible and continual skill composition from multiple finetuned models with minimal storage and computational overhead. Our code is available at https://github.com/uiuctml/Localize-and-Stitch.

Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic

TL;DR

The paper tackles interference in global model merging of multiple finetuned models by introducing Localize-and-Stitch, which localizes essential finetuning skills into tiny sparse regions (about of parameters) and stitches them back onto the pretrained base. Localization uses a learnable mask optimized with an -regularized objective to preserve task performance, yielding interpretable, minimal regions that minimize cross-task conflicts. Stitching aggregates masked task vectors so that , with overlaps handled by averaging and no hyperparameter tuning, enabling data-efficient merging and straightforward continual learning. Empirically, the method achieves state-of-the-art or competitive multi-task performance across NLP, vision, and decoder-based language models, even in dataless settings, and enables substantial model compression (down to ~1% of original storage) while preserving pretrained knowledge and supporting scalable skill composition with minimal storage and compute.

Abstract

Model merging offers an effective strategy to combine the strengths of multiple finetuned models into a unified model that preserves the specialized capabilities of each. Existing methods merge models in a global manner, performing arithmetic operations across all model parameters. However, such global merging often leads to task interference, degrading the performance of the merged model. In this work, we introduce Localize-and-Stitch, a novel approach that merges models in a localized way. Our algorithm works in two steps: i) Localization: identify tiny ( of the total parameters) localized regions in the finetuned models containing essential skills for the downstream tasks, and ii) Stitching: reintegrate only these essential regions back into the pretrained model for task synergy. We demonstrate that our approach effectively locates sparse regions responsible for finetuned performance, and the localized regions could be treated as compact and interpretable representations of the finetuned models (tasks). Empirically, we evaluate our method on various vision and language benchmarks, showing that it outperforms existing model merging methods under different data availability scenarios. Beyond strong empirical performance, our algorithm also facilitates model compression and preserves pretrained knowledge, enabling flexible and continual skill composition from multiple finetuned models with minimal storage and computational overhead. Our code is available at https://github.com/uiuctml/Localize-and-Stitch.
Paper Structure (33 sections, 9 equations, 14 figures, 17 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 14 figures, 17 tables, 1 algorithm.

Figures (14)

  • Figure 1: Localize-and-Stitch: Given $n$ models $\{\theta_{\text{ft}}^{(i)}\}_{i=1}^n$ finetuned from $\theta_{\text{pre}}$, we first localize regions containing skills acquired during finetuning through per-model binary masks $\{\gamma_i\}_{i=1}^n$, then stitch the localized regions $\{\gamma_i\odot\theta_{\text{ft}}^{(i)}\}_{i=1}^n$ onto the pretrained model, where $\odot$ is the element-wise product. Empty nodes after the localization step mean that the mask is not activated at that position. Since the localized regions are tiny $(\sim 1\%)$, we reduce potential task conflicts and make minimal changes to the pretrained model.
  • Figure 2: Our method most effectively locates sparse regions essential for finetuned performance. Sparsity level indicates the proportion of total parameters localized. By localizing only $1\%$ of parameters (at sparsity level 0.01), our approach recovers $99\%$ of the finetuned performance (at sparsity level 1).
  • Figure 3: Merged models with more parameter overlap manifest more task conflicts, resulting in performance decrease. The overlapped proportion is over the model's total parameter count. The simple averaging baseline is over all model parameters.
  • Figure 4: Our localized regions (each task with $1\%$ of total parameters) have little pairwise overlap, with the majority of Jaccard similarity below 5$\%$. The sentiment classification tasks (SST-2, CR, MR, MPQA) have relatively large overlap because they share similar skills in the overlapping regions, and we verify this by showing that they have high cosine similarity of masked task vectors.
  • Figure 5: The localized regions are predominantly found in the LayerNorm parameters, while different tasks are associated with different layers. The percentages represent the proportion of localized parameters in each component.
  • ...and 9 more figures