Table of Contents
Fetching ...

Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging

Li Shen, Anke Tang, Enneng Yang, Guibing Guo, Yong Luo, Lefei Zhang, Xiaochun Cao, Bo Du, Dacheng Tao

TL;DR

An efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE.

Abstract

Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.

Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging

TL;DR

An efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE.

Abstract

Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. Recent research on task arithmetic-based MTL demonstrates that merging the parameters of independently fine-tuned models can effectively achieve MTL. However, existing merging methods primarily seek a static optimal solution within the original model parameter space, which often results in performance degradation due to the inherent diversity among tasks and potential interferences. To address this challenge, in this paper, we propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. Specifically, we first identify critical (or sensitive) modules by analyzing parameter variations in core modules of Transformer-based models before and after finetuning. Then, our WEMoE statically merges non-critical modules while transforming critical modules into a mixture-of-experts (MoE) structure. During inference, expert modules in the MoE are dynamically merged based on input samples, enabling a more flexible and adaptive merging approach. Building on WEMoE, we further introduce an efficient-and-effective WEMoE (E-WEMoE) method, whose core mechanism involves eliminating non-essential elements in the critical modules of WEMoE and implementing shared routing across multiple MoE modules, thereby significantly reducing both the trainable parameters, the overall parameter count, and computational overhead of the merged model by WEMoE. Experimental results across various architectures and tasks demonstrate that both WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.

Paper Structure

This paper contains 34 sections, 14 equations, 16 figures, 12 tables.

Figures (16)

  • Figure 1: Comparison of the relationship between parameter count and performance across various model merging methods.
  • Figure 2: An illustration of the loss landscapes of $s_1$, $s_2$, and $s_1 \cup s_2$. There is no static solution $\theta'$ that simultaneously minimizes the loss of both tasks better than $\mathop{\mathrm{arg\,min}}\limits_\theta \mathcal{L}_1(\theta) + \mathcal{L}_2(\theta)$.
  • Figure 3: (a) Overview of the Weight-Ensembling Mixture of Experts (WEMoE) Framework. This figure illustrates the overall framework of our proposed approach for merging the pre-trained model with fine-tuned task-specific models. We perform weight merging across the Transformer layers, excluding the MLPs. For the MLPs, we upcycle them into weight-assembling MoE modules. (b) WEMoE module. This diagram details the structure of the WEMoE module, which consists of a router, the pre-trained MLP weights, and a set of task vectors w.r.t. MLP modules.
  • Figure 4: The distance between the parameters of the pre-trained model and the fine-tuned models. The first sub-figure shows the average $L_2$ distance of CLIP-ViT-B/32 on eight datasets, and the last sub-figure is on CLIP-ViT-B/16.
  • Figure 5: (a) Overview of the Efficient Weight-Ensembling Mixture of Experts (E-WEMoE) Framework. It merges all non-MLP modules through task arithmetic and upgrades the MLP modules into an efficient E-WEMoE module. (b) E-WEMoE Module. The module includes a router shared across all Transformer blocks, the pre-trained MLP module, and a set of sparse task-specific vectors w.r.t. MLP modules.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Definition 2.1: Model Merging
  • Definition 2.2: Task Vector ilharcoEditingModelsTask2023
  • Definition 2.3: Task Vector based Model Merging