Table of Contents
Fetching ...

Parameter Efficient Multi-task Model Fusion with Partial Linearization

Anke Tang, Li Shen, Yong Luo, Yibing Zhan, Han Hu, Bo Du, Yixin Chen, Dacheng Tao

TL;DR

This work proposes a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning, which partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters.

Abstract

Large pre-trained models have enabled significant advances in machine learning and served as foundation components. Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, leading to inefficient multi-task model fusion. In this work, we propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning. Specifically, our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters. This allows us to leverage the the advantages of model fusion over linearized fine-tuning, while still performing fine-tuning and inference efficiently. We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone. Experimental results demonstrate the capabilities of our proposed partial linearization technique to effectively construct unified multi-task models via the fusion of fine-tuned task vectors. We evaluate performance over an increasing number of tasks and find that our approach outperforms standard parameter-efficient fine-tuning techniques. The results highlight the benefits of partial linearization for scalable and efficient multi-task model fusion. The code is available at https://github.com/tanganke/peta

Parameter Efficient Multi-task Model Fusion with Partial Linearization

TL;DR

This work proposes a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning, which partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters.

Abstract

Large pre-trained models have enabled significant advances in machine learning and served as foundation components. Model fusion methods, such as task arithmetic, have been proven to be powerful and scalable to incorporate fine-tuned weights from different tasks into a multi-task model. However, efficiently fine-tuning large pre-trained models on multiple downstream tasks remains challenging, leading to inefficient multi-task model fusion. In this work, we propose a novel method to improve multi-task fusion for parameter-efficient fine-tuning techniques like LoRA fine-tuning. Specifically, our approach partially linearizes only the adapter modules and applies task arithmetic over the linearized adapters. This allows us to leverage the the advantages of model fusion over linearized fine-tuning, while still performing fine-tuning and inference efficiently. We demonstrate that our partial linearization technique enables a more effective fusion of multiple tasks into a single model, outperforming standard adapter tuning and task arithmetic alone. Experimental results demonstrate the capabilities of our proposed partial linearization technique to effectively construct unified multi-task models via the fusion of fine-tuned task vectors. We evaluate performance over an increasing number of tasks and find that our approach outperforms standard parameter-efficient fine-tuning techniques. The results highlight the benefits of partial linearization for scalable and efficient multi-task model fusion. The code is available at https://github.com/tanganke/peta
Paper Structure (35 sections, 12 equations, 11 figures, 8 tables)

This paper contains 35 sections, 12 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Loss landscape visualization. Here, we visualize the loss landscape $\mathcal{L}(\tau_1; \theta) + \mathcal{L}(\tau_2; \theta)$ for CLIP model on combinations of three downstream image classification tasks by interpolating on the 2D plane. $\theta = \theta_0 + \sum_{i=1}^2 \lambda_i (\theta_i - \theta_0)$, where $\theta_0$ are the pre-trained weights, $\theta_i$ are the task-specific full fine-tuned weights for task $\tau_i$. From these heatmaps, we observe that task-specific models reside in the same loss basin when evaluated on the joint task.
  • Figure 2: Similarity heatmaps. These figures show heatmaps of the cosine similarity between task vectors from task-specific CLIP models radford_learning_2021 fine-tuned on different tasks. (a) Cos similarity matrix of task vectors when using full fine-tuning of the entire model. (b) Task vector similarities when using LoRA. (c) Cos similarity of task vectors when using L-LoRA, our proposed partial linearization approach that linearizes PEFT modules and fine-tunes in tangent space.
  • Figure 3: Four types of fine-tuning paradigms. (a) Full parameter fine-tuning. (b) Full-model linearization. (c) Parameter-efficient fine-tuning. (d) Linearized parameter-efficient fine-tuning. In this paper, we explore LoRA fine-tuning and linearized LoRA (L-LoRA) fine-tuning.
  • Figure 4: Pairs of model fusion. These figures show scatter plots demonstrating the performance of different model fusion techniques on pairs of tasks. Each plot corresponds to a different fusion method. The x and y axes in each plot denote the normalized scores on the two tasks. Points indicate the performance of specific instances. Dashed lines represent the average performance per task for each method. (a) Image classification tasks. (b) NLP tasks.
  • Figure 5: Multi-task model fusion. we construct multi-task models by utilizing task vectors specific to individual tasks, employing various model fusion algorithms. In the evaluation, the x-axis represents the number of task vectors used in building the multi-task model, while the y-axis represents the average normalized scores of the resulting models across all seven downstream tasks. The lines on the plot represent the average normalized scores of all multi-task models when considering a fixed number of tasks, and the shaded area corresponds to the standard deviation.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Remark 3.1