Table of Contents
Fetching ...

Multi-Task Model Merging via Adaptive Weight Disentanglement

Feng Xiong, Runxi Cheng, Wang Chen, Zhanqiu Zhang, Yiwen Guo, Chun Yuan, Ruifeng Xu

TL;DR

The paper addresses the challenge of efficiently merging multiple task-specific fine-tuned weights into a single multi-task model while minimizing interference. It introduces Adaptive Weight Disentanglement (AWD), which decomposes each task vector $\tau_i$ into a disentangled part and a redundant vector $\delta$, optimizing for mutual orthogonality among disentangled vectors and small $\|\,\delta\,\|$ under a weighted objective $\mathcal{L}=\mathcal{L}_{\mathcal{O}}+\alpha\mathcal{L}_{\mathcal{R}}$. The authors formalize a Task Consistency Property and argue that approximate orthogonality among task vectors yields near-interference-free merging, supported by Taylor-based analyses and empirical results across vision and language models. Empirically, AWD improves performance over state-of-the-art merging methods on ViT and RoBERTa, demonstrates robustness to task count and coefficient settings, and generalizes to unseen tasks; visualizations show a larger low-loss basin for the disentangled vectors, indicating reduced interference. Overall, AWD provides a principled, scalable approach to post-hoc model merging with strong theoretical backing and broad applicability to CV and NLP tasks.

Abstract

Model merging has recently gained attention as an economical and scalable approach to incorporate task-specific weights from various tasks into a unified multi-task model. For example, in Task Arithmetic (TA), adding the fine-tuned weights of different tasks can enhance the model's performance on those tasks, while subtracting them leads to task forgetting. Although TA is highly effective, interference among task still hampers the performance of the merged model. Existing methods for handling conflicts between task generally rely on empirical selection, resulting in suboptimal performance. In this paper, we introduce an Adaptive Weight Disentanglement method. We begin by theoretically proving that task vectors employed in model merging should be orthogonal to minimize interference among tasks. Guided by this insight, we initialize redundant vectors such that, when subtracted from the original task vectors, the resulting vectors exhibit increased orthogonality. Additionally, we impose an norm constraint on the redundant vectors to preserve the performance of the task-specific models. Experimental results demonstrate the effectiveness of our proposed technique: it successfully extracts redundant vectors, and after their subtraction, the task vectors not only retain robust performance but also achieve superior fusion outcomes. Our code is available at \href{https://github.com/FarisXiong/AWD.git}{https://github.com/FarisXiong/AWD.git}.

Multi-Task Model Merging via Adaptive Weight Disentanglement

TL;DR

The paper addresses the challenge of efficiently merging multiple task-specific fine-tuned weights into a single multi-task model while minimizing interference. It introduces Adaptive Weight Disentanglement (AWD), which decomposes each task vector into a disentangled part and a redundant vector , optimizing for mutual orthogonality among disentangled vectors and small under a weighted objective . The authors formalize a Task Consistency Property and argue that approximate orthogonality among task vectors yields near-interference-free merging, supported by Taylor-based analyses and empirical results across vision and language models. Empirically, AWD improves performance over state-of-the-art merging methods on ViT and RoBERTa, demonstrates robustness to task count and coefficient settings, and generalizes to unseen tasks; visualizations show a larger low-loss basin for the disentangled vectors, indicating reduced interference. Overall, AWD provides a principled, scalable approach to post-hoc model merging with strong theoretical backing and broad applicability to CV and NLP tasks.

Abstract

Model merging has recently gained attention as an economical and scalable approach to incorporate task-specific weights from various tasks into a unified multi-task model. For example, in Task Arithmetic (TA), adding the fine-tuned weights of different tasks can enhance the model's performance on those tasks, while subtracting them leads to task forgetting. Although TA is highly effective, interference among task still hampers the performance of the merged model. Existing methods for handling conflicts between task generally rely on empirical selection, resulting in suboptimal performance. In this paper, we introduce an Adaptive Weight Disentanglement method. We begin by theoretically proving that task vectors employed in model merging should be orthogonal to minimize interference among tasks. Guided by this insight, we initialize redundant vectors such that, when subtracted from the original task vectors, the resulting vectors exhibit increased orthogonality. Additionally, we impose an norm constraint on the redundant vectors to preserve the performance of the task-specific models. Experimental results demonstrate the effectiveness of our proposed technique: it successfully extracts redundant vectors, and after their subtraction, the task vectors not only retain robust performance but also achieve superior fusion outcomes. Our code is available at \href{https://github.com/FarisXiong/AWD.git}{https://github.com/FarisXiong/AWD.git}.

Paper Structure

This paper contains 28 sections, 21 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of our Adaptive Weight Disentanglement.
  • Figure 2: Comparative Performance of Fine-Tuned ViT-B/32 and RoBERTa Model Variants.
  • Figure 3: Impact of task numbers and coefficients on average accuracy for ViT-B/32.
  • Figure 4: Cosine similarity heatmaps for task vectors and disentangled task vectors on ViT-B/32 and ViT-L/14.
  • Figure 5: Loss landscape visualization. We visualize the loss landscape $\mathcal{L}_i(\widehat{\Theta}) + \mathcal{L}_j(\widehat{\Theta})$ by interpolating for ViT-B/32.
  • ...and 2 more figures