Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

Ruochen Jin; Bojian Hou; Jiancong Xiao; Weijie Su; Li Shen

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, Li Shen

TL;DR

This work tackles interference in task arithmetic by showing that fine-tuning only the attention modules of transformers yields strong weight disentanglement while preserving high individual-task accuracy. By revealing kernel-like behavior in attention and separating representation from task-specific heads, the method achieves superior unified-model performance with substantially lower training cost than NTK-based linearization. The representation module is identified as the main source of weight disentanglement, whereas task-specific heads can limit it, leading to practical design guidance. Empirically, the approach delivers up to 2.38% improvement in average unified accuracy on vision-language benchmarks and demonstrates robust performance across a range of mixing coefficients $\alpha$ with improved efficiency over prior baselines.

Abstract

In recent years, task arithmetic has garnered increasing attention. This approach edits pre-trained models directly in weight space by combining the fine-tuned weights of various tasks into a unified model. Its efficiency and cost-effectiveness stem from its training-free combination, contrasting with traditional methods that require model training on large datasets for multiple tasks. However, applying such a unified model to individual tasks can lead to interference from other tasks (lack of weight disentanglement). To address this issue, Neural Tangent Kernel (NTK) linearization has been employed to leverage a "kernel behavior", facilitating weight disentanglement and mitigating adverse effects from unrelated tasks. Despite its benefits, NTK linearization presents drawbacks, including doubled training costs, as well as reduced performance of individual models. To tackle this problem, we propose a simple yet effective and efficient method that is to finetune the attention modules only in the Transformer. Our study reveals that the attention modules exhibit kernel behavior, and fine-tuning the attention modules only significantly improves weight disentanglement. To further understand how our method improves the weight disentanglement of task arithmetic, we present a comprehensive study of task arithmetic by differentiating the role of the representation module and task-specific module. In particular, we find that the representation module plays an important role in improving weight disentanglement whereas the task-specific modules such as the classification heads can degenerate the weight disentanglement performance. (The code is available at https://github.com/kyrie-23/task_arithmetic_tangent)

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

TL;DR

with improved efficiency over prior baselines.

Abstract

Paper Structure (23 sections, 6 equations, 10 figures, 5 tables)

This paper contains 23 sections, 6 equations, 10 figures, 5 tables.

Introduction
Preliminaries: Task Arithmetic and Weight Disentanglement
Non-linear Fine-Tuning.
Accuracy Gap.
NTK Linearization Fine-tuning.
Task Arithmetic in Attention Modules
Main Challenge of Task Arithmetic
Accuracy Gap: Kernel Behavior and Weight Disentanglement of Attention Module
Kernel Behavior Test.
Accuracy of Individual Models with Fine-Tuning Attention Modules
Accuracy of Unified Models with Fine-tuning Attention Modules
Robustness of Task Arithmetic with Respect to Coefficient $\alpha$
Weight Disentanglement Emerges From Representation Module
Weight Disentanglement Results
Related Work
...and 8 more sections

Figures (10)

Figure 1: Illustration of the concepts of task arithmetic and weight disentanglement. On the left-hand side, in task arithmetic, we first finetune the pre-trained model $\theta_0$ and get the finetuned individual model $\theta_0+{\alpha_t}\tau_t$ where $\tau_t$ is the $t$th task vector. We eventually obtain the unified model by adding all the task vectors to the pre-trained model: $\theta_0+\sum_{t=1}^T{\alpha_t}\tau_t$. On the right-hand side, weight disentanglement means that the prediction of the unified model on a specific task will not be affected by other tasks.
Figure 2: Logic flow of our work.
Figure 3: Accuracy of non-linear and post-hoc models by tasks. The diagonal dashed line indicates post-hoc performance meets non-linear.
Figure 4: Three types of fine-tuning paradigms. (a) Non-linear fine-tuning where all the parameters will be updated. (b) Full-model linearization. (c) Attention modules only fine-tuning where only $W_q$, $W_v$, $W_k$ and $W_o$ will be updated. In this paper, we explore attention modules only fine-tuning.
Figure 5: Averaged accuracy of non-linear and linear models. The diagonal dashed line indicates linear fine-tuning performance meets non-linear.
...and 5 more figures

Theorems & Definitions (2)

Definition 1: Task Vector and Task Arithmetic
Definition 2: Weight disentanglement

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

TL;DR

Abstract

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

Authors

TL;DR

Abstract

Table of Contents

Figures (10)

Theorems & Definitions (2)