Table of Contents
Fetching ...

MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

Yuyan Zhou, Liang Song, Bingning Wang, Weipeng Chen

TL;DR

This paper proposes Model Exclusive Task Arithmetic for merging GPT-scale models (MetaGPT) which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model.

Abstract

The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A}rithmetic for merging \textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.

MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic

TL;DR

This paper proposes Model Exclusive Task Arithmetic for merging GPT-scale models (MetaGPT) which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model.

Abstract

The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A}rithmetic for merging \textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.
Paper Structure (37 sections, 6 theorems, 36 equations, 4 figures, 9 tables)

This paper contains 37 sections, 6 theorems, 36 equations, 4 figures, 9 tables.

Key Result

Lemma 4

Using Taylor expansion for $\mathcal{L}(\bm{\theta}_{\textup{final}}, \bm{x})$ at $\bm{\theta}_t$, the $\textup{TLD}_t$ in Eq. tldeq can be reformulated as a quadratic form with respect to the linear combination of $\bm{\lambda}$ and $\bm{\theta}$: where $\gamma_t(\beta) = \bm{\theta}_t+\beta(\bm{\theta}_{\textup{final}} - \bm{\theta}_t)$ and $\bm{h}_t$ is the linear combination of $\bm{\lambda}$

Figures (4)

  • Figure 1: Existing methods face the trilemma of performance, data privacy, and computational costs, which hinders its application to LLMs. Our MetaGPT can solve these problems under careful approximation and thus can scale to GPT3-scale LLMs.
  • Figure 2: Current task arithmetic based methods face the problems of sub-optimal performance, huge computational and memory cost, curse of dimensionality and data privacy, which makes it difficult to scale to LLMs. Our method solves the aforementioned problems and provides an avenue to scale task arithmetic to LLMs.
  • Figure 3: Verification of NTK linearization. We randomly sampled the outputs of Llama-2-7b-chat-hf with different $\alpha$. We can see that the sampled outputs are linearly with $\alpha$ as expected.
  • Figure 4: Verification of orthogonality. We calculate the cosine similarity between six different task vectors and find that their cosine similarity is nearly 0.

Theorems & Definitions (9)

  • Definition 1: Single Task Loss Difference
  • Definition 2: Average Task Loss Difference
  • Definition 3: Optimization objective of MetaGPT
  • Lemma 4
  • Theorem 7
  • Theorem 8
  • Theorem 9: $\lambda$ decomposition of ALD
  • Theorem 10: Optimal Scaling Coefficients
  • Lemma 11