Table of Contents
Fetching ...

Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion

Binchi Zhang, Zaiyi Zheng, Zhengzhang Chen, Jundong Li

TL;DR

The paper identifies a limitation of permutation symmetry in transformers and introduces rotation symmetry in self-attention as a continuous generalization. It develops a theoretically optimal parameter matching algorithm based on rotation (and optional rescaling) that serves as a plug-and-play module to improve model fusion, with closed-form solutions derived via Orthogonal Procrustes/Kabsch-like methods. The method aligns both FFN and ATTN components, reducing end-model distances and yielding smoother loss landscapes, as demonstrated on NLP and ViT benchmarks with favorable ablations and complexity profiles. This work suggests that exploiting parameter space symmetry can meaningfully enhance transfer, fusion, and robustness of large transformer models in diverse domains.

Abstract

Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry.

Beyond the Permutation Symmetry of Transformers: The Role of Rotation for Model Fusion

TL;DR

The paper identifies a limitation of permutation symmetry in transformers and introduces rotation symmetry in self-attention as a continuous generalization. It develops a theoretically optimal parameter matching algorithm based on rotation (and optional rescaling) that serves as a plug-and-play module to improve model fusion, with closed-form solutions derived via Orthogonal Procrustes/Kabsch-like methods. The method aligns both FFN and ATTN components, reducing end-model distances and yielding smoother loss landscapes, as demonstrated on NLP and ViT benchmarks with favorable ablations and complexity profiles. This work suggests that exploiting parameter space symmetry can meaningfully enhance transfer, fusion, and robustness of large transformer models in diverse domains.

Abstract

Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry.

Paper Structure

This paper contains 38 sections, 1 theorem, 22 equations, 6 figures, 3 tables, 1 algorithm.

Key Result

Theorem 4.1

The following optimization problem has a closed-form solution. The solution is given by where $\bm{I}$ is the identity matrix and $\bm{U}\bm{\Sigma}\bm{V}^\top=\bm{W}_{Q_1}\bm{W}_{Q_2}^\top+\bm{W}_{K_1}\bm{W}_{K_2}^\top+\bm{b}_{Q_1}^\top\bm{b}_{Q_2}+\bm{b}_{K_1}^\top\bm{b}_{K_2}$ is the result of eigendecomposition.

Figures (6)

  • Figure 1: The rotation symmetry of self-attention layers.
  • Figure 2: An intuitive example of the usage of parameter space symmetry for model fusion. The background shows the contour map of the loss landscape in the model parameter space. A and B are the original end models to be merged, AB is the result of naive model fusion, and AB' is the result of model fusion with parameter matching.
  • Figure 3: Ablation Study of ViT merging over the image classification task. "ATTN" is short for "attention". We compare our matching algorithm with its three variants (w/o ATTN/FFN/rescaling) and the original performance (w/o match) on all six merging baselines.
  • Figure 4: The Euclidean Distance of end ViT models after different parameter matching algorithms.
  • Figure 5: Loss landscapes and barriers between the two pretrained ViT models under four distinct matching settings. "LB" is short for "Loss Barrier".
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • proof