Table of Contents
Fetching ...

The Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse

Ekansh Sharma, Daniel M. Roy, Gintare Karolina Dziugaite

TL;DR

A multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models is proposed.

Abstract

Model merging aims to efficiently combine the weights of multiple expert models, each trained on a specific task, into a single multi-task model, with strong performance across all tasks. When applied to all but the last layer of weights, existing methods -- such as Task Arithmetic, TIES-merging, and TALL mask merging -- work well to combine expert models obtained by fine-tuning a common foundation model, operating within a "local" neighborhood of the foundation model. This work explores the more challenging scenario of "non-local" merging, which we find arises when an expert model changes significantly during pretraining or where the expert models do not even share a common foundation model. We observe that standard merging techniques often fail to generalize effectively in this non-local setting, even when accounting for permutation symmetries using standard techniques. We identify that this failure is, in part, due to "variance collapse", a phenomenon identified also in the setting of linear mode connectivity by Jordan et al. (2023). To address this, we propose a multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models. Our experiments demonstrate that this correction significantly improves the performance of various model merging approaches in non-local settings, providing a strong baseline for future research on this problem.

The Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse

TL;DR

A multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models is proposed.

Abstract

Model merging aims to efficiently combine the weights of multiple expert models, each trained on a specific task, into a single multi-task model, with strong performance across all tasks. When applied to all but the last layer of weights, existing methods -- such as Task Arithmetic, TIES-merging, and TALL mask merging -- work well to combine expert models obtained by fine-tuning a common foundation model, operating within a "local" neighborhood of the foundation model. This work explores the more challenging scenario of "non-local" merging, which we find arises when an expert model changes significantly during pretraining or where the expert models do not even share a common foundation model. We observe that standard merging techniques often fail to generalize effectively in this non-local setting, even when accounting for permutation symmetries using standard techniques. We identify that this failure is, in part, due to "variance collapse", a phenomenon identified also in the setting of linear mode connectivity by Jordan et al. (2023). To address this, we propose a multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models. Our experiments demonstrate that this correction significantly improves the performance of various model merging approaches in non-local settings, providing a strong baseline for future research on this problem.

Paper Structure

This paper contains 34 sections, 7 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Local vs Non-Local Accuracy Landscape. For the VGG16 model architecture, we visualize the average normalized accuracy landscape in the span of the task vectors (treating the initialization as the origin) for two classification tasks: RESISC45 and Colorectal Histology. (Left) Local merging: Expert models share the same fine-tuning initialization; (center) Non-local merging: Expert models are derived from different pretrained models; (Right) Non-local merging modulo permutation and TACT: Expert models are derived from different pretrained models and merged modulo permutation, followed by task-specific activation repair.
  • Figure 2: Local vs Non-local Worst Case Landscape. For the same architecture and tasks as in \ref{['fig:avg-loss-local-v-non-local']}, we visualize the worst-case accuracy landscape, i.e., for every setting of weights in the hyperplane, the color conveys the performance of task with the lowest normalized accuracy. (Left) No permutation: Using the average of the two foundation models as initialization, $\theta_\mathrm{init}$, we look at the worst case accuracy in the span of the two task vectors $\tau_t = \theta_t-\theta_\mathrm{init}$. Worst case accuracy is low throughout. (Center) Modulo permutation: Using the average of two foundation models modulo permutations as initialization, we plot the worst-case accuracy along the permuted task vectors. We notice a moderate improvement in worst case performance; (Right) Modulo permutation + TACT: Like the center plot, but the TACT correction is applied before evaluating accuracy, resulting in a significant improvement to worst case accuracy.
  • Figure 3: Internal statistics of a merged model modulo permutations. We plot the layer-wise activation statistics after merging 4 tasks using Task Arithmetic. We use of a batch of Colorectal Histology dataset for computing the statistics. (Left) Per-layer $\ell_2$ distance between the activation vector of a merged model on task $t$, and activation vector of expert model for task $t$. (Right) The ratio of activation vector variance between the merged model on task $t$ to the expert model on task $t$, computed for each layer. We find that the internal statistics of the merged model on task $t$ do not match the internal statistics the expert model for task $t$.
  • Figure 4: t-SNE projection of the last layer embeddings. For a batch of data, we visualize the t-SNE projections of the embeddings encoded by different merging methods models. (Left) Colorectal Histology, (Right) CIFAR10. We observe that embeddings after TACT correction are closer to the embedding of the expert model.
  • Figure 5: Non-local merging ablation: We plot the average normalized accuracy across tasks (y-axis) versus the number of tasks merged (x-axis) different merging methods (VGG16).(Left) TACT: Applying TACT significantly improves the performance when compared to the baseline non-local merging; (Center) REPAIR: Applying REPAIR improves the baseline non-local merging when merging few tasks but the performance deteriorates when we merge more number of models ; (Right) Linear Probing: Linear probing the final layer leads to moderate gains to merging method but still under performs when compared to TACT.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: The local merging problem
  • Definition 2: The non-local merging problem