Table of Contents
Fetching ...

Vanilla Group Equivariant Vision Transformer: Simple and Effective

Jiahong Fu, Qi Xie, Deyu Meng, Zongben Xu

TL;DR

A straightforward framework is proposed that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance, and serves as a plug-and-play replacement that is both theoretically grounded and practically versatile.

Abstract

Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.

Vanilla Group Equivariant Vision Transformer: Simple and Effective

TL;DR

A straightforward framework is proposed that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance, and serves as a plug-and-play replacement that is both theoretically grounded and practically versatile.

Abstract

Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.
Paper Structure (16 sections, 2 theorems, 20 equations, 3 figures, 3 tables)

This paper contains 16 sections, 2 theorems, 20 equations, 3 figures, 3 tables.

Key Result

Theorem 1

Let $\Phi_{eq}(\cdot)$ denotes an equivariant transformer including $L$-layer equivariant Self-Attention defined in Eq.(eq:equi_sa). For an image $\mathbf{x}$ with size $H\times W\times c_0$, then the following result is satisfied for $\forall \tilde{g} \in S$: where $\pi_{\tilde{g}}$ is a group transformation on the feature map and $\left[\cdot\right]$ denotes the composition of functions.

Figures (3)

  • Figure 1: Overview of the proposed Equivariant Vision Transformer. (a) The overall pipeline of the whole equivariant vision transformer. (b)-(c) The details of the EQ-Patch Embedding and the EQ-Self-Attention, respectively.
  • Figure 2: The visualization of the input and the proposed Equivariant APE for $C_4$ group. For EQ-APE, different numbers represent different orbits. And $\mathbf{p}_c(\cdot, \cdot)$ denote the canonical representations
  • Figure 3: The visual comparison of the 4 times super resolution results from various methods on img078 of the Urban100.

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2