Permutation Equivariance of Transformers and Its Applications

Hengyuan Xu; Liyao Xiang; Hangyu Ye; Dixi Yao; Pengzhi Chu; Baochun Li

Permutation Equivariance of Transformers and Its Applications

Hengyuan Xu, Liyao Xiang, Hangyu Ye, Dixi Yao, Pengzhi Chu, Baochun Li

TL;DR

The work addresses how Transformer models handle permutation of inputs and parameters beyond simple inter-token shuffling by introducing permutation equivariance that covers both inter- and intra-token shuffling in forward and backward passes. It develops a formal framework with row and column permutations $P_R$ and $P_C$, proves that Transformer encoders are forward-permutation-equivariant and backward-permutation-invariant, and extends these results to general networks built from permutation-equivariant operators, with corresponding gradient mappings. Empirically, it validates these properties across ViT, BERT, and GPT2, demonstrates practical uses in privacy-preserving split learning and model authorization, and shows the approach incurs negligible computational overhead. The findings broaden the applicability of permutation properties in ordered-input tasks and offer new leverage for privacy, security, and model-protection strategies in real-world deployments.

Abstract

Revolutionizing the field of deep learning, Transformer-based models have achieved remarkable performance in many tasks. Recent research has recognized these models are robust to shuffling but are limited to inter-token permutation in the forward propagation. In this work, we propose our definition of permutation equivariance, a broader concept covering both inter- and intra- token permutation in the forward and backward propagation of neural networks. We rigorously proved that such permutation equivariance property can be satisfied on most vanilla Transformer-based models with almost no adaptation. We examine the property over a range of state-of-the-art models including ViT, Bert, GPT, and others, with experimental validations. Further, as a proof-of-concept, we explore how real-world applications including privacy-enhancing split learning, and model authorization, could exploit the permutation equivariance property, which implicates wider, intriguing application scenarios.

Permutation Equivariance of Transformers and Its Applications

TL;DR

and

, proves that Transformer encoders are forward-permutation-equivariant and backward-permutation-invariant, and extends these results to general networks built from permutation-equivariant operators, with corresponding gradient mappings. Empirically, it validates these properties across ViT, BERT, and GPT2, demonstrates practical uses in privacy-preserving split learning and model authorization, and shows the approach incurs negligible computational overhead. The findings broaden the applicability of permutation properties in ordered-input tasks and offer new leverage for privacy, security, and model-protection strategies in real-world deployments.

Abstract

Paper Structure (26 sections, 16 theorems, 92 equations, 9 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 16 theorems, 92 equations, 9 figures, 7 tables, 1 algorithm.

Introduction
Related Works
Transformer
Permutation Equi-/In-variance
Notations
Properties and Proofs
Permutation Equivariance of Transformer
General Permutation Equivariant Networks
Other Components
Experiments
Setup
Properties Validation
Applications
Efficiency
Conclusion
...and 11 more sections

Key Result

Theorem 4.1

Transformer encoder is permutation equivariant w.r.t. token permutations, i.e. the row permutation of the input matrix, in forward propagation, i.e., $\mathrm{Enc}({\bm{P}}_R {\bm{Z}}) = {\bm{P}}_R \mathrm{Enc} ({\bm{Z}})$ for any permutation matrix ${\bm{P}}_R \in \mathbb{R}^{n \times n}$.

Figures (9)

Figure 1: Illustration of Transformer backbone. Learnable weights in permutation are expressed by yellow blocks.
Figure 2: Illustration of permutation properties. ${\bm{W}}$ indicates main parameters in Transformer backbone (stacked Transformer encoders and decoders).
Figure 3: Reconstruction results of model inversion attacks to features. '+' means the privacy-preserving technique is enhanced by our row permutation.
Figure 4: Training curves of fine-tuning ViT. The authorized has a performance close to normal while the unauthorized has a high loss.
Figure 5: Validation loss curves of ViT trained to convergence. The unauthorized is far worse than the authorized but better than train-from-scratch.
...and 4 more figures

Theorems & Definitions (25)

Theorem 4.1: Row Permutation Forward Equivariance
Theorem 4.2: Row Permutation Backward Invariance
Corollary 4.3
Theorem 4.4: Column Permutation Forward Equivariance
Theorem 4.5: Column Permutation Backward Equivariance
Corollary 4.6
Theorem 4.7: General Permutation Equivalent Networks
Lemma 4.8
proof
proof
...and 15 more

Permutation Equivariance of Transformers and Its Applications

TL;DR

Abstract

Permutation Equivariance of Transformers and Its Applications

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (25)