Model Merging by Uncertainty-Based Gradient Matching

Nico Daheim; Thomas Möllenhoff; Edoardo Maria Ponti; Iryna Gurevych; Mohammad Emtiyaz Khan

Model Merging by Uncertainty-Based Gradient Matching

Nico Daheim, Thomas Möllenhoff, Edoardo Maria Ponti, Iryna Gurevych, Mohammad Emtiyaz Khan

TL;DR

The inaccuracy of weighted-averaging to mismatches in the gradients is connected to mismatches in the gradients and a new uncertainty-based scheme to improve the performance by reducing the mismatch is proposed.

Abstract

Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here.

Model Merging by Uncertainty-Based Gradient Matching

TL;DR

Abstract

Paper Structure (29 sections, 2 theorems, 38 equations, 2 figures, 6 tables)

This paper contains 29 sections, 2 theorems, 38 equations, 2 figures, 6 tables.

Introduction
Model Merging by Parameter Averaging
Model Merging and Connections to Gradient Mismatches
Analyzing the Inaccuracy of Task Arithmetic on Large Language Models
A New Method to Reduce the Gradient Mismatch
Relationship to existing methods
A new method for data removal
Gradient Mismatch Reduction as Uncertainty Estimation
Experiments & Results
Gradient Mismatch & Test Performance
Adding Tasks to Pretrained Vision Transformers
Sentiment Classification in NLP
Editing Language Generation Models By Removing Data
Conclusion
Derivations
...and 14 more sections

Key Result

Theorem 1

For linear regression models with loss $\bar{\ell}_t(\hbox{$\hbox{$\boldsymbol{\theta}$}$}) = \hbox{$\frac{1}{2}$} \| \hbox{$\hbox{$\mathbf{y}$}$}_t - \hbox{$\hbox{$\mathbf{X}$}$}_t \hbox{$\hbox{$\boldsymbol{\theta}$}$} \|^2$ where $\hbox{$\hbox{$\mathbf{y}$}$}_t$ is the output vector and $\hbox{$\h

Figures (2)

Figure 1: The left panel illustrates our approach. We connect the error $\Delta$ of the merged model ${\hbox{$\hbox{$\boldsymbol{\theta}$}$}_{\text{merged}}}$ to the gradient mismatch over losses ${\bar{\ell}_t}$ and propose a new method that reduces the mismatch by using the Hessian $\hbox{$\hbox{$\mathbf{H}$}$}_t$ and error ${\Delta}_t$ of the individual models $\hbox{$\hbox{$\boldsymbol{\theta}$}$}_t$. The right panel shows an example of adding datasets to RoBERTa trained on IMDB. We clearly see that reducing mismatch reduces test error of task arithmetic ($\alpha_t=1)$. We consider 5 datasets, indicated by a number on the markers.
Figure 2: Left: We merge models trained on $8$ image classification tasks with a pretrained ViT and vary $\alpha_t$. Our method performs similarly to TA for smaller but significantly better for higher $\alpha_t$, improving over the best $\alpha_t$ for TA. Right: We add four sentiment analysis tasks to RoBERTa trained on IMDB. Our merging function dominates TA and requires no tuning of scaling factors. We plot the average over individual dataset accuracies.

Theorems & Definitions (2)

Theorem 1
Theorem 2

Model Merging by Uncertainty-Based Gradient Matching

TL;DR

Abstract

Model Merging by Uncertainty-Based Gradient Matching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)