Table of Contents
Fetching ...

Octic Vision Transformers: Quicker ViTs Through Equivariance

David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman

TL;DR

This work tackles the inefficiency of exploiting geometric symmetries in Vision Transformers by introducing octic ViTs that realize $D_8$-equivariant layers. The approach uses Fourier-domain block-diagonal intertwiners and steerable features to achieve substantial compute and memory savings (e.g., ~5.33x fewer FLOPs per linear layer) while preserving accuracy on ImageNet-1K in both supervised and self-supervised settings. Through extensive experiments (DeiT-III and DINOv2) and targeted ablations, the authors show that fully or partially octic architectures can match or exceed baseline performance with up to ~40% FLOP reductions, and that invariantization and the number of octic blocks materially influence outcomes. These results suggest that large-scale equivariant design is practical and beneficial for ViTs, offering a pathway to faster, more memory-efficient vision models without sacrificing performance.

Abstract

Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network. Training octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K, we find that they match baseline accuracy while at the same time providing substantial efficiency gains.

Octic Vision Transformers: Quicker ViTs Through Equivariance

TL;DR

This work tackles the inefficiency of exploiting geometric symmetries in Vision Transformers by introducing octic ViTs that realize -equivariant layers. The approach uses Fourier-domain block-diagonal intertwiners and steerable features to achieve substantial compute and memory savings (e.g., ~5.33x fewer FLOPs per linear layer) while preserving accuracy on ImageNet-1K in both supervised and self-supervised settings. Through extensive experiments (DeiT-III and DINOv2) and targeted ablations, the authors show that fully or partially octic architectures can match or exceed baseline performance with up to ~40% FLOP reductions, and that invariantization and the number of octic blocks materially influence outcomes. These results suggest that large-scale equivariant design is practical and beneficial for ViTs, offering a pathway to faster, more memory-efficient vision models without sacrificing performance.

Abstract

Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network. Training octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K, we find that they match baseline accuracy while at the same time providing substantial efficiency gains.

Paper Structure

This paper contains 48 sections, 17 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Computational savings. Using octic layers in ViTs significantly reduces the computational complexity without sacrificing accuracy on ImageNet-1K, for both supervised and self-supervised training. Detailed results can be found in Section \ref{['sec:experiments']}.
  • Figure 2: $\mathrm{D}_8$ Linear layers. Implementing equivariant linear layers in the Fourier domain of $\mathrm{D}_8$ gives a major computational benefit. Left: A $C\times C$ weight matrix being multiplied by $L$ tokens of feature dimension $C$. Center: The block-diagonalization that happens when enforcing the layer to be $\mathrm{D}_8$-equivariant in the Fourier domain. More precisely, we enforce equivariance with respect to the representation $\frac{C}{8}\rho_\text{iso}$ that splits into irreps $\rho_\text{A1}, \rho_\text{A2}, \rho_\text{B1}, \rho_\text{B2}$ and $\rho_\text{E}$ as detailed in Section \ref{['sec:method']}. There is no mixing between different irreps and the weight sharing in the block-diagonal stems from the fact that $\rho_\text{E}$ is a two-dimensional irrep. Right: An efficient implementation of the original $C\times C$ by $C \times L$ matrix multiplication as four $\frac{C}{8}\times\frac{C}{8}$ by $\frac{C}{8}\times L$ and one $\frac{C}{4}\times\frac{C}{4}$ by $\frac{C}{4}\times 2L$ matrix multiplication. An equivariant linear layer of this type requires $16/3\approx 5.33$ times fewer FLOPs to compute and has $8$ times fewer parameters than the ordinary linear layer shown to the left.
  • Figure 3: Architecture. Patches are first extracted from an image using specialized octic filters and the resulting features are processed by $k$ octic ViT blocks. The final embeddings can be fed to $l-k$ standard Transformer blocks (as demonstrated by our $\mathcal{H}_8$ and $\mathcal{I}_8$ ViTs). When $k=l$, we denote $\mathcal{I}_8(\text{ViT})$ by $\mathrm{D}_8\xspace(\text{ViT})$, which hence only uses octic ViT blocks before a final invariantization.
  • Figure 4: (a) Reduction in FLOPs from a non-equivariant Transformer block to an octic-equivariant block vs. embedding dimension. The matmul ratio reflects only matrix multiplications in linear layers and Attention; the total ratio includes all computations. (b) The effect of changing the number of octic blocks ($k$) for ViT-L, out of $l=24$ total blocks.
  • Figure 5: (a) PatchEmbed filters from a trained network. More filters are shown in Figure \ref{['fig:filters-comparison']}. (b-c) Cayley diagrams showing the action of $\mathrm{D}_8$ on (b) patchified images and (c) $\rho_\text{iso}$-features. Blue arrows mean horizontal mirroring, $s$, while orange arrows mean mirroring in the bottom-left to top-right diagonal, $sr$. The features were obtained by applying the filters in (a) to the patches in (b).
  • ...and 2 more figures

Theorems & Definitions (7)

  • Example 3.1: Irreducible representations
  • Example 3.2: Regular representation
  • Example 3.3: Isotypical decomposition / Fourier transform
  • Example 3.4: Images
  • Example 3.5: ViT features
  • Example 3.6: Steerable ViT features
  • Example 3.7: Patchification of images