Table of Contents
Fetching ...

Transformer Fusion with Optimal Transport

Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

TL;DR

The paper tackles the problem of efficiently merging independently trained Transformer models. It introduces OTFusion, a framework that uses Optimal Transport to softly align weights or activations across models and across Transformer components (residuals, multi-head attention, embeddings, LN), enabling both homogeneous and heterogeneous width fusion. A Transportation Map Flow Graph and Transformer-specific fusion strategies—including cross-head alignment and soft alignment via Sinkhorn regularization—are developed to fuse blocks coherently. Experiments on Vision Transformer and BERT show that soft, activation-based OT fusion yields strong one-shot performance and, after finetuning, often surpasses both parent models and vanilla fusion, while enabling efficient heterogeneous fusion and reducing inference costs. This work provides a scalable, generalizable approach to recombining Transformer capabilities across domains, with practical implications for model compression, transfer, and ensemble-free deployment.

Abstract

Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.

Transformer Fusion with Optimal Transport

TL;DR

The paper tackles the problem of efficiently merging independently trained Transformer models. It introduces OTFusion, a framework that uses Optimal Transport to softly align weights or activations across models and across Transformer components (residuals, multi-head attention, embeddings, LN), enabling both homogeneous and heterogeneous width fusion. A Transportation Map Flow Graph and Transformer-specific fusion strategies—including cross-head alignment and soft alignment via Sinkhorn regularization—are developed to fuse blocks coherently. Experiments on Vision Transformer and BERT show that soft, activation-based OT fusion yields strong one-shot performance and, after finetuning, often surpasses both parent models and vanilla fusion, while enabling efficient heterogeneous fusion and reducing inference costs. This work provides a scalable, generalizable approach to recombining Transformer capabilities across domains, with practical implications for model compression, transfer, and ensemble-free deployment.

Abstract

Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities. Past attempts have been restricted to the case of fully-connected, convolutional, and residual networks. This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components. We flesh out an abstraction for layer alignment, that can generalize to arbitrary architectures - in principle - and we apply this to the key ingredients of Transformers such as multi-head self-attention, layer-normalization, and residual connections, and we discuss how to handle them via various ablation studies. Furthermore, our method allows the fusion of models of different sizes (heterogeneous fusion), providing a new and efficient way to compress Transformers. The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. In our analysis, we uncover intriguing insights about the significant role of soft alignment in the case of Transformers. Our results showcase the potential of fusing multiple Transformers, thus compounding their expertise, in the budding paradigm of model fusion and recombination. Code is available at https://github.com/graldij/transformer-fusion.
Paper Structure (45 sections, 7 equations, 12 figures, 17 tables)

This paper contains 45 sections, 7 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: TM flow graph for a residual connection.
  • Figure 2: Self-Attention flow graph.
  • Figure 3: ViT embeddings flow graph.
  • Figure 4: 2D slice of the accuracy landscapes of the anchor and one-shot OT and VF fused models.
  • Figure 5: (a) Sinkhorn regularizer effect on one-shot performance; (b) stability with different seeds for activations-based fusion over a different number of samples; (c) performance with different activations-filtering strategies for a different number of samples; (d) different transport map policies for residual connections over a different number of samples.
  • ...and 7 more figures