Table of Contents
Fetching ...

Revisiting Weight Averaging for Model Merging

Jiho Choi, Donggyun Kim, Chanhyuk Lee, Seunghoon Hong

TL;DR

This work revisits weight averaging for model merging by reframing it as task arithmetic centered at the weight average, revealing that centering combined with a low-rank approximation of task vectors dramatically reduces inter-task interference. The authors introduce CART (Centered Arithmetic with Rank-reduced Task vectors), a training-free merging method that uses top-k singular vectors of centered task differences, achieving robust gains across vision and NLP benchmarks and various backbone sizes. They provide theoretical and empirical evidence linking reduced row-space interference to improved merging, and show that an optimal rank around 8% consistently yields strong performance. CART can be integrated with existing task-arithmetic approaches and extended to test-time adaptation or model compression, offering a practical, scalable solution for multi-task merging without additional training data.

Abstract

Model merging aims to build a multi-task learner by combining the parameters of individually fine-tuned models without additional training. While a straightforward approach is to average model parameters across tasks, this often results in suboptimal performance due to interference among parameters across tasks. In this paper, we present intriguing results that weight averaging implicitly induces task vectors centered around the weight averaging itself and that applying a low-rank approximation to these centered task vectors significantly improves merging performance. Our analysis shows that centering the task vectors effectively reduces task interference and most of task-specific knowledge is concentrated in the top singular vectors. Our method demonstrates robust and scalable performance on vision benchmarks across varying numbers of tasks and model sizes. Furthermore, we observe that our approach is applicable to natural language processing tasks with competitive performance.

Revisiting Weight Averaging for Model Merging

TL;DR

This work revisits weight averaging for model merging by reframing it as task arithmetic centered at the weight average, revealing that centering combined with a low-rank approximation of task vectors dramatically reduces inter-task interference. The authors introduce CART (Centered Arithmetic with Rank-reduced Task vectors), a training-free merging method that uses top-k singular vectors of centered task differences, achieving robust gains across vision and NLP benchmarks and various backbone sizes. They provide theoretical and empirical evidence linking reduced row-space interference to improved merging, and show that an optimal rank around 8% consistently yields strong performance. CART can be integrated with existing task-arithmetic approaches and extended to test-time adaptation or model compression, offering a practical, scalable solution for multi-task merging without additional training data.

Abstract

Model merging aims to build a multi-task learner by combining the parameters of individually fine-tuned models without additional training. While a straightforward approach is to average model parameters across tasks, this often results in suboptimal performance due to interference among parameters across tasks. In this paper, we present intriguing results that weight averaging implicitly induces task vectors centered around the weight averaging itself and that applying a low-rank approximation to these centered task vectors significantly improves merging performance. Our analysis shows that centering the task vectors effectively reduces task interference and most of task-specific knowledge is concentrated in the top singular vectors. Our method demonstrates robust and scalable performance on vision benchmarks across varying numbers of tasks and model sizes. Furthermore, we observe that our approach is applicable to natural language processing tasks with competitive performance.

Paper Structure

This paper contains 40 sections, 2 theorems, 22 equations, 13 figures, 4 tables, 2 algorithms.

Key Result

Theorem 1

Assume a multi-task model defined as $\theta_{\mathrm{MTL}} = \theta_0 + \sum_{t=1}^T \tau_t$ and inputs $x_{t,i}$ to each task $t$ lie close to the subspace spanned by row space of $\tau_t$. If we define Task Interference $L := \sum_{t=1}^T \sum_{i=1}^{n} \| \theta_{\mathrm{MTL}} x_{t,i} - (\theta_

Figures (13)

  • Figure 1: An average performance of model merging with low-rank approximations of task vectors. $\dagger$ denotes the method with test-time adaptation.
  • Figure 2: For the ViT-B/32 model with 8 vision tasks, we compute the row space interference $I(k)$ for both centered and original task vectors. The plots illustrate the results for a layer, showing that the centered task vectors consistently exhibit lower $I(k)$ values compared to the original task vectors across all ranks $k$. Layer-wise means and detailed distribution plot for each layer is provided in the Appendix \ref{['sec:anal I']}.
  • Figure 2: Multi-Task Performance on eight NLP Tasks with Merged RoBERTa-base with CART.
  • Figure 3: For the ViT-B/32 model with 8 vision tasks, the reconstruction error $R(k)$ and row space interference $I(k)$ are computed for the centered task vectors. In the results obtained for a layer, it is observed that as the rank $k$ increases, $R(k)$ exhibits a sharp decline in the low-rank regime. Concurrently, $I(k)$ demonstrates a gradual rise with increasing rank.
  • Figure 4: RADAR plot illustrating performance on sets of 8, 14, and 20 vision classification tasks for ViT-B/32, with individual task accuracies normalized by their respective single-task performance. Corresponding results for ViT-L/14 are presented separately in Appendix \ref{['sec:large_model']}.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • proof