Table of Contents
Fetching ...

Model Merging by Uncertainty-Based Gradient Matching

Nico Daheim, Thomas Möllenhoff, Edoardo Maria Ponti, Iryna Gurevych, Mohammad Emtiyaz Khan

TL;DR

The inaccuracy of weighted-averaging to mismatches in the gradients is connected to mismatches in the gradients and a new uncertainty-based scheme to improve the performance by reducing the mismatch is proposed.

Abstract

Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here.

Model Merging by Uncertainty-Based Gradient Matching

TL;DR

The inaccuracy of weighted-averaging to mismatches in the gradients is connected to mismatches in the gradients and a new uncertainty-based scheme to improve the performance by reducing the mismatch is proposed.

Abstract

Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here.
Paper Structure (29 sections, 2 theorems, 38 equations, 2 figures, 6 tables)

This paper contains 29 sections, 2 theorems, 38 equations, 2 figures, 6 tables.

Key Result

Theorem 1

For linear regression models with loss $\bar{\ell}_t(\hbox{$\hbox{$\boldsymbol{\theta}$}$}) = \hbox{$\frac{1}{2}$} \| \hbox{$\hbox{$\mathbf{y}$}$}_t - \hbox{$\hbox{$\mathbf{X}$}$}_t \hbox{$\hbox{$\boldsymbol{\theta}$}$} \|^2$ where $\hbox{$\hbox{$\mathbf{y}$}$}_t$ is the output vector and $\hbox{$\h

Figures (2)

  • Figure 1: The left panel illustrates our approach. We connect the error $\Delta$ of the merged model ${\hbox{$\hbox{$\boldsymbol{\theta}$}$}_{\text{merged}}}$ to the gradient mismatch over losses ${\bar{\ell}_t}$ and propose a new method that reduces the mismatch by using the Hessian $\hbox{$\hbox{$\mathbf{H}$}$}_t$ and error ${\Delta}_t$ of the individual models $\hbox{$\hbox{$\boldsymbol{\theta}$}$}_t$. The right panel shows an example of adding datasets to RoBERTa trained on IMDB. We clearly see that reducing mismatch reduces test error of task arithmetic ($\alpha_t=1)$. We consider 5 datasets, indicated by a number on the markers.
  • Figure 2: Left: We merge models trained on $8$ image classification tasks with a pretrained ViT and vary $\alpha_t$. Our method performs similarly to TA for smaller but significantly better for higher $\alpha_t$, improving over the best $\alpha_t$ for TA. Right: We add four sentiment analysis tasks to RoBERTa trained on IMDB. Our merging function dominates TA and requires no tuning of scaling factors. We plot the average over individual dataset accuracies.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2