AdaMerging: Adaptive Model Merging for Multi-Task Learning

Enneng Yang; Zhenyi Wang; Li Shen; Shiwei Liu; Guibing Guo; Xingwei Wang; Dacheng Tao

AdaMerging: Adaptive Model Merging for Multi-Task Learning

Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, Dacheng Tao

TL;DR

The paper tackles merging pre-trained, task-specific models for multi-task learning without access to original training data. It introduces AdaMerging, an unsupervised method that learns merging coefficients per task vector or per layer by minimizing prediction entropy on unlabeled test samples, and extends this with AdaMerging++ variants to handle sign and redundancy issues. Across eight image-classification tasks with ViT backbones, AdaMerging substantially outperforms prior task-vector merging methods, improves generalization to unseen tasks, and shows strong robustness to data distribution shifts. The study reveals that learned coefficients are task- and layer-specific and that entropy minimization reliably correlates with loss, enabling practical, data-efficient coefficient optimization.

Abstract

Multi-task learning (MTL) aims to empower a model to tackle multiple tasks simultaneously. A recent development known as task arithmetic has revealed that several models, each fine-tuned for distinct tasks, can be directly merged into a single model to execute MTL without necessitating a retraining process using the initial training data. Nevertheless, this direct addition of models often leads to a significant deterioration in the overall performance of the merged model. This decline occurs due to potential conflicts and intricate correlations among the multiple tasks. Consequently, the challenge emerges of how to merge pre-trained models more effectively without using their original training data. This paper introduces an innovative technique called Adaptive Model Merging (AdaMerging). This approach aims to autonomously learn the coefficients for model merging, either in a task-wise or layer-wise manner, without relying on the original training data. Specifically, our AdaMerging method operates as an automatic, unsupervised task arithmetic scheme. It leverages entropy minimization on unlabeled test samples from the multi-task setup as a surrogate objective function to iteratively refine the merging coefficients of the multiple models. Our experimental findings across eight tasks demonstrate the efficacy of the AdaMerging scheme we put forth. Compared to the current state-of-the-art task arithmetic merging scheme, AdaMerging showcases a remarkable 11\% improvement in performance. Notably, AdaMerging also exhibits superior generalization capabilities when applied to unseen downstream tasks. Furthermore, it displays a significantly enhanced robustness to data distribution shifts that may occur during the testing phase.

AdaMerging: Adaptive Model Merging for Multi-Task Learning

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 16 figures, 12 tables)

This paper contains 16 sections, 1 equation, 16 figures, 12 tables.

Introduction
Related Work
Methodology
Preliminaries
Adaptive Model Merging for Multi-Task Learning
AdaMerging: Adaptive Model Merging
Entropy Optimization
Experiment
Experiment Setup
Performance, Generalization, Robustness
AdaMerging Analysis
Conclusion and Future Work
Experiment Settings
Experiment Results
Performance, Generalization and Robustness
...and 1 more sections

Figures (16)

Figure 1: The impact of coefficient $\lambda$ on the average accuracy of various MTL methods on eight tasks. Among them, Task Arithmetic TaskArithmetic_ICLR2023 and Ties-Merging TiesMerging_NeurIPS2023 based on task vectors achieved the best average accuracy when coefficient $\small \lambda=0.3$, which were $\small 69.1\%$ and $\small 72.9\%$ respectively. Traditional MTL and our AdaMerging are $\small 88.9\%$ and $\small 80.1\%$.
Figure 2: (a) Definition of ‘‘task vector", the task vector $\small T_k$ is obtained by subtracting the pre-trained weights $\small \theta_{pre}$ from the model weights $\small \theta_{k}$ fine-tuned on the data of task $k$. (b) Task ArithmeticTaskArithmetic_ICLR2023 for MTL, which assigns same merging coefficient $\small \lambda$ to each task vector $\small T_k$ ($\small k \in \{A,B\}$). (c) Task-wise AdaMerging for MTL, which learns a distinct merging coefficient $\small \lambda_k$ to each task vector $\small T_k$ ($\small k \in \{A,B\})$. (d) Layer-wise AdaMerging for MTL, which learns a distinct merging coefficient $\lambda_k^l$ to each layer $\small l$$\small (l \in \{1,2\})$ of the task vector $\small T_k$ ($\small k \in \{A,B\})$.
Figure 3: Correlation of entropy $\small H(\hat{Y})$ and avareage loss $\small L(Y,\hat{Y})$ on eight tasks (or datasets). (a) We divided the test samples into eleven groups according to the entropy of the samples, and observed the average prediction loss of the samples in each group. We observe that groups with smaller entropy correspond to smaller average losses. (b) We calculated the Spearman correlation coefficient between entropy and prediction loss on eight tasks (or datasets) and observed a high positive correlation.
Figure 4: Learned model merging coefficients $\small \{\lambda^l_k\}_{k=1,l=1}^{K,L}$ of Layer-wise AdaMerging (Above) and AdaMerging++ (Below) on ViT-B/32. The $k$-th row represents the $k$-th task vector, the $l$-th column represents the $l$-th layer, and the intersection point represents the coefficient $\small \lambda^l_k$.
Figure 5: An example of corruption data visualization, in which the corruption image generation method refers to RobustnessBenchmarking2019.
...and 11 more figures

AdaMerging: Adaptive Model Merging for Multi-Task Learning

TL;DR

Abstract

AdaMerging: Adaptive Model Merging for Multi-Task Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)