Octic Vision Transformers: Quicker ViTs Through Equivariance

David Nordström; Johan Edstedt; Fredrik Kahl; Georg Bökman

Octic Vision Transformers: Quicker ViTs Through Equivariance

David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman

TL;DR

This work tackles the inefficiency of exploiting geometric symmetries in Vision Transformers by introducing octic ViTs that realize $D_8$-equivariant layers. The approach uses Fourier-domain block-diagonal intertwiners and steerable features to achieve substantial compute and memory savings (e.g., ~5.33x fewer FLOPs per linear layer) while preserving accuracy on ImageNet-1K in both supervised and self-supervised settings. Through extensive experiments (DeiT-III and DINOv2) and targeted ablations, the authors show that fully or partially octic architectures can match or exceed baseline performance with up to ~40% FLOP reductions, and that invariantization and the number of octic blocks materially influence outcomes. These results suggest that large-scale equivariant design is practical and beneficial for ViTs, offering a pathway to faster, more memory-efficient vision models without sacrificing performance.

Abstract

Why are state-of-the-art Vision Transformers (ViTs) not designed to exploit natural geometric symmetries such as 90-degree rotations and reflections? In this paper, we argue that there is no fundamental reason, and what has been missing is an efficient implementation. To this end, we introduce Octic Vision Transformers (octic ViTs) which rely on octic group equivariance to capture these symmetries. In contrast to prior equivariant models that increase computational cost, our octic linear layers achieve 5.33x reductions in FLOPs and up to 8x reductions in memory compared to ordinary linear layers. In full octic ViT blocks the computational reductions approach the reductions in the linear layers with increased embedding dimension. We study two new families of ViTs, built from octic blocks, that are either fully octic equivariant or break equivariance in the last part of the network. Training octic ViTs supervised (DeiT-III) and unsupervised (DINOv2) on ImageNet-1K, we find that they match baseline accuracy while at the same time providing substantial efficiency gains.

Octic Vision Transformers: Quicker ViTs Through Equivariance

TL;DR

This work tackles the inefficiency of exploiting geometric symmetries in Vision Transformers by introducing octic ViTs that realize

-equivariant layers. The approach uses Fourier-domain block-diagonal intertwiners and steerable features to achieve substantial compute and memory savings (e.g., ~5.33x fewer FLOPs per linear layer) while preserving accuracy on ImageNet-1K in both supervised and self-supervised settings. Through extensive experiments (DeiT-III and DINOv2) and targeted ablations, the authors show that fully or partially octic architectures can match or exceed baseline performance with up to ~40% FLOP reductions, and that invariantization and the number of octic blocks materially influence outcomes. These results suggest that large-scale equivariant design is practical and beneficial for ViTs, offering a pathway to faster, more memory-efficient vision models without sacrificing performance.

Octic Vision Transformers: Quicker ViTs Through Equivariance

TL;DR

Abstract

Octic Vision Transformers: Quicker ViTs Through Equivariance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)

Theorems & Definitions (7)