Table of Contents
Fetching ...

Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

Filippo Rinaldi, Giacomo Capitani, Lorenzo Bonicelli, Donato Crisostomi, Federico Bolelli, Elisa Ficarra, Emanuele Rodolà, Simone Calderara, Angelo Porrello

TL;DR

The paper addresses updating Transformer backbones while preserving downstream fine-tuning without accessing training data, by introducing TransFusion, a two-level, permutation-based re-basin method that aligns base models and transports the task vector through a data-free procedure. It employs a permutation-invariant, spectral-distance-based attention alignment to handle multi-head attention and residual connections, and uses a transport equation $\tilde{\theta}_B^{ft} = \theta_B + \alpha \pi(\tau)$ to apply the fine-tuning to the new checkpoint. Key contributions include a data-free alignment framework, a spectral metric for inter-head matching, guarantees of functional equivalence for the composed permutations, and demonstrated improvements across vision and NLP tasks with no additional data. The approach enables efficient, scalable maintenance of up-to-date models, reducing retraining costs and privacy concerns while preserving performance and generalization on unseen data.

Abstract

Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at https://github.com/aimagelab/TransFusion.

Update Your Transformer to the Latest Release: Re-Basin of Task Vectors

TL;DR

The paper addresses updating Transformer backbones while preserving downstream fine-tuning without accessing training data, by introducing TransFusion, a two-level, permutation-based re-basin method that aligns base models and transports the task vector through a data-free procedure. It employs a permutation-invariant, spectral-distance-based attention alignment to handle multi-head attention and residual connections, and uses a transport equation to apply the fine-tuning to the new checkpoint. Key contributions include a data-free alignment framework, a spectral metric for inter-head matching, guarantees of functional equivalence for the composed permutations, and demonstrated improvements across vision and NLP tasks with no additional data. The approach enables efficient, scalable maintenance of up-to-date models, reducing retraining costs and privacy concerns while preserving performance and generalization on unseen data.

Abstract

Foundation models serve as the backbone for numerous specialized models developed through fine-tuning. However, when the underlying pretrained model is updated or retrained (e.g., on larger and more curated datasets), the fine-tuned model becomes obsolete, losing its utility and requiring retraining. This raises the question: is it possible to transfer fine-tuning to a new release of the model? In this work, we investigate how to transfer fine-tuning to a new checkpoint without having to re-train, in a data-free manner. To do so, we draw principles from model re-basin and provide a recipe based on weight permutations to re-base the modifications made to the original base model, often called task vector. In particular, our approach tailors model re-basin for Transformer models, taking into account the challenges of residual connections and multi-head attention layers. Specifically, we propose a two-level method rooted in spectral theory, initially permuting the attention heads and subsequently adjusting parameters within select pairs of heads. Through extensive experiments on visual and textual tasks, we achieve the seamless transfer of fine-tuned knowledge to new pre-trained backbones without relying on a single training step or datapoint. Code is available at https://github.com/aimagelab/TransFusion.

Paper Structure

This paper contains 23 sections, 4 theorems, 35 equations, 6 figures, 4 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $P_{\text{inter\_head}} \in S_H$ be a permutation over the $H$ attention heads, and let $P_{\text{intra\_head}} = \{ P_{\text{intra\_head}}^{(i)} \}_{i=1}^{H}$ be a set of independent permutations acting within each head (of size $d_k = \tfrac{d_m}{H}$). Then applying the composed block permutat

Figures (6)

  • Figure 1: Transporting task vector $\tau$ from a fine-tuned base model $\theta_A^{ft} = \theta_A + \tau$ to a new release $\theta_B$.
  • Figure 2: Inter- (Step 1) and intra-head alignment (Step 2).
  • Figure 3: Residuals block and permutations.
  • Figure 4: Zero-shot gain/drop relative to $\theta_B$ of naive $\theta_B+\alpha\tau$ (blue) and our strategy $\theta_B+\alpha\pi(\tau)$ (red) varying $\alpha$.
  • Figure 5: Loss values on CIFAR-10 test set during model interpolation. Top: Our permutation approach vs. vanilla interpolation and no residual variant. Bottom: Comparison with Optimal Transport and Git Re-Basin methods, which fail to preserve functional equivalence as $\alpha \rightarrow 0$.
  • ...and 1 more figures

Theorems & Definitions (8)

  • Theorem 3.1: Equivariance of Multi-Head Attention to Structured Permutations
  • Proposition 3.2
  • Proposition 1.1
  • proof
  • proof
  • proof
  • Proposition 1.2
  • proof