Table of Contents
Fetching ...

Merging Text Transformer Models from Different Initializations

Neha Verma, Maha Elbayad

TL;DR

The paper tackles how independently trained Transformer minima relate in the loss landscape and whether they can be merged via permutations. It proposes a Transformer-aware permutation-based merging method that aligns Multi-Headed Attention, residuals, and feed-forward components through correlation-based permutation matrices, enabling low-loss interpolation between minima. Empirically, it demonstrates substantial reductions in loss barriers relative to vanilla averaging on masked language modeling and, to a variable extent, on GLUE finetuning, with detailed analyses by component and architectural part. The findings imply that Transformer minima are less isolated than previously thought, highlighting permutation invariances as a key factor for optimization, model merging, and ensembling in large pretrained architectures.

Abstract

Recent work on permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging, across models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.

Merging Text Transformer Models from Different Initializations

TL;DR

The paper tackles how independently trained Transformer minima relate in the loss landscape and whether they can be merged via permutations. It proposes a Transformer-aware permutation-based merging method that aligns Multi-Headed Attention, residuals, and feed-forward components through correlation-based permutation matrices, enabling low-loss interpolation between minima. Empirically, it demonstrates substantial reductions in loss barriers relative to vanilla averaging on masked language modeling and, to a variable extent, on GLUE finetuning, with detailed analyses by component and architectural part. The findings imply that Transformer minima are less isolated than previously thought, highlighting permutation invariances as a key factor for optimization, model merging, and ensembling in large pretrained architectures.

Abstract

Recent work on permutation-based model merging has shown impressive low- or zero-barrier mode connectivity between models from completely different initializations. However, this line of work has not yet extended to the Transformer architecture, despite its dominant popularity in the language domain. Therefore, in this work, we investigate the extent to which separate Transformer minima learn similar features, and propose a model merging technique to investigate the relationship between these minima in the loss landscape. The specifics of the architecture, like its residual connections, multi-headed attention, and discrete, sequential input, require specific interventions in order to compute model permutations that remain within the same functional equivalence class. In merging these models with our method, we consistently find lower loss barriers between minima compared to model averaging, across models trained on a masked-language modeling task or fine-tuned on a language understanding benchmark. Our results show that the minima of these models are less sharp and isolated than previously understood, and provide a basis for future work on merging separately trained Transformer models.
Paper Structure (22 sections, 11 equations, 7 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 11 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Example of a Transformer layer with its parameters outlined in blue boxes, specific hidden states as circles, and the flow of operations indicated with arrows. $\mathop{\mathrm{\scalerel*{\stackinset{c}{}{c}{}{$|$}{\text{\textbigcircle}}}{\sum}}}\limits$ indicates the dot-product operation, and $\bigoplus$ indicates addition. LN refers to LayerNorm modules. We include permutation and inverse permutation matrices at each weight matrix to indicate the proposed operations from our method. ${\bm{P}}_{\text{res}}$ refers to the residual permutation, ${\bm{P}}_\text{MHA}$ refers to the multi-headed attention (MHA) permutation, and ${\bm{P}}_\text{FF}$ aligns the feed-forward layers.
  • Figure 2: Example permutation matrices resulting from different strategies for attention head alignment. Each ${\bm{P}}_i$ reflects permutations for features within attention heads.
  • Figure 3: Pseudo-perplexity scores of BERTs, trained on the masked language modeling task, combined using our method. Curves differ by which components they merge. Results across 10 merges are shown with standard error regions shaded around each curve. Each additional merged component leads to further barrier reduction.
  • Figure 4: Average feature correlations between layers from different MultiBERTs. We report correlations for both Feed-Forward and MHA features. Both components see much higher average correlation after applying their respective component permutations. Values are averaged over 10 merges, with standard error regions shaded.
  • Figure 5: Visualization of correlation matrices between features before and after permuting. These features are from the seventh multi-headed attention layer from 2 different MultiBERTs models. On the left, 12 attention head boundaries are clearly visible, and highly correlated regions do not necessarily correspond to the same attention head indices, supporting our two-stage permutation method. On the right, the two-stage permutation method outcome can be seen via the dark diagonal line, and its surrounding block diagonal pattern.
  • ...and 2 more figures