Equivariant Neural Functional Networks for Transformers

Viet-Hoang Tran; Thieu N. Vo; An Nguyen The; Tho Tran Huu; Minh-Khoi Nguyen-Nhat; Thanh Tran; Duy-Tung Pham; Tan Minh Nguyen

Equivariant Neural Functional Networks for Transformers

Viet-Hoang Tran, Thieu N. Vo, An Nguyen The, Tho Tran Huu, Minh-Khoi Nguyen-Nhat, Thanh Tran, Duy-Tung Pham, Tan Minh Nguyen

TL;DR

This work addresses the lack of principled NFN designs for Transformer architectures by deriving the maximal symmetric group of Multi-Head Attention and formulating a weight-space group action for Transformers, then introducing Transformer-NFN, an NFN that is equivariant to this group via specialized invariant and equivariant polynomial layers. It provides a formal design framework for transformer NFNs and releases the Small Transformer Zoo, a benchmark of over $10^5$ Transformer checkpoints across vision and language tasks to enable data-driven evaluation. Empirically, Transformer-NFN outperforms baselines in both MNIST-Transformers and AGNews-Transformers settings, with ablations showing the encoder component and compact configurations are sufficient for strong generalization predictions. The dataset and methodology offer a practical pathway to analyze transformer training dynamics and generalization from weight information, promoting further exploration of symmetry-aware NFN designs in modern architectures.

Abstract

This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data and have proven valuable for tasks such as learnable optimizers, implicit data representations, and weight editing. While NFN have been extensively developed for MLP and CNN, no prior work has addressed their design for transformers, despite the importance of transformers in modern deep learning. This paper aims to address this gap by providing a systematic study of NFN for transformers. We first determine the maximal symmetric group of the weights in a multi-head attention module as well as a necessary and sufficient condition under which two sets of hyperparameters of the multi-head attention module define the same function. We then define the weight space of transformer architectures and its associated group action, which leads to the design principles for NFN in transformers. Based on these, we introduce Transformer-NFN, an NFN that is equivariant under this group action. Additionally, we release a dataset of more than 125,000 Transformers model checkpoints trained on two datasets with two different tasks, providing a benchmark for evaluating Transformer-NFN and encouraging further research on transformer training and performance.

Equivariant Neural Functional Networks for Transformers

TL;DR

Transformer checkpoints across vision and language tasks to enable data-driven evaluation. Empirically, Transformer-NFN outperforms baselines in both MNIST-Transformers and AGNews-Transformers settings, with ablations showing the encoder component and compact configurations are sufficient for strong generalization predictions. The dataset and methodology offer a practical pathway to analyze transformer training dynamics and generalization from weight information, promoting further exploration of symmetry-aware NFN designs in modern architectures.

Abstract

Paper Structure (75 sections, 22 theorems, 189 equations, 2 figures, 9 tables)

This paper contains 75 sections, 22 theorems, 189 equations, 2 figures, 9 tables.

Introduction
Background
Self-attention.
Multihead attention.
Contribution
Related Work
Maximal Symmetric Group of A Multi-head Attention
Weight Space of a Transformer Block and Group Action
Permutation Matrices
Weight space
Group action on weight space
Equivariant Polynomial NFNs for Transformers
The Small Transformer Zoo dataset
General settings.
Experimental Results
...and 60 more sections

Key Result

Theorem 3.1

Let $D$ be a positive integer. Assume that for a positive integer $h$, matrices $A_1, A_2, \ldots, A_h \in \mathbb{R}^{D \times D}$ and $B_1, B_2, \ldots, B_h \in \mathbb{R}^{D \times D}$, we have for all positive integers $L$ and $X \in \mathbb{R}^{L \times D}$. Then, if $A_1, A_2, \ldots, A_h$ are pairwise distinct, then

Figures (2)

Figure 1: Accuracy histogram of MNIST task and AGNews task in the Small Transformer Zoo. The number of samples is showed in log scale for improved visibility.
Figure 2: Visualization of all models on test set of AGNews-Transformers dataset.

Theorems & Definitions (51)

Theorem 3.1: Independence of heads in multi-head attention
Remark 1
Theorem 3.2: Maximal symmetric group of multi-head attentions
Remark 2: Necessity of the assumptions $(a)$, $(b)$, $(c)$, and $(d)$
Remark 3: Maximal symmetric group of multi-head attentions
Definition 4.1
Remark 4: Permutation matrix vs. permutation
Definition 4.2: Group action
Theorem 4.3: Invariance of $\operatorname{Attn}$ under the action of $\mathcal{G}_{\mathcal{U}}$
Remark 5: Other types of symmetries
...and 41 more

Equivariant Neural Functional Networks for Transformers

TL;DR

Abstract

Equivariant Neural Functional Networks for Transformers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (51)