Table of Contents
Fetching ...

Steerable Transformers for Volumetric Data

Soumyabrata Kundu, Risi Kondor

TL;DR

Steerable Transformers extend Vision Transformers to volumetric data by preserving SE$(d)$-equivariance through a Fourier-space, steerable-attention mechanism built on top of a steerable convolutional encoder. The approach introduces steerable positional encodings and Fourier-domain nonlinearities to maintain symmetry while efficiently coupling local and global representations. Empirical results across RotMNIST, ModelNet10, PH2, and BraTS show consistent improvements over steerable-convolution baselines, validating the integration of equivariant transformers with structured volumetric data. The framework promises robust, geometry-aware analysis with potential impact in high-stakes domains like medical imaging where rotational and translational invariance are critical.

Abstract

We introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $\mathrm{SE}(d)$. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space non-linearities. Our experiments in both two and three dimensions show that adding steerable transformer layers to steerable convolutional networks enhances performance.

Steerable Transformers for Volumetric Data

TL;DR

Steerable Transformers extend Vision Transformers to volumetric data by preserving SE-equivariance through a Fourier-space, steerable-attention mechanism built on top of a steerable convolutional encoder. The approach introduces steerable positional encodings and Fourier-domain nonlinearities to maintain symmetry while efficiently coupling local and global representations. Empirical results across RotMNIST, ModelNet10, PH2, and BraTS show consistent improvements over steerable-convolution baselines, validating the integration of equivariant transformers with structured volumetric data. The framework promises robust, geometry-aware analysis with potential impact in high-stakes domains like medical imaging where rotational and translational invariance are critical.

Abstract

We introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group . We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space non-linearities. Our experiments in both two and three dimensions show that adding steerable transformer layers to steerable convolutional networks enhances performance.
Paper Structure (44 sections, 2 theorems, 55 equations, 6 figures, 3 tables)

This paper contains 44 sections, 2 theorems, 55 equations, 6 figures, 3 tables.

Key Result

Theorem 1

For any $f^{\textrm{in}}$ that transforms under the action of $\mathrm{SE}(d)$ according to the equivariance constraint eq: equivariance constraint, the output of the self-attention mechanism also satisfies eq: equivariance constraint for any collection of weight matrices if and only if, for every $ for all $\textbf{x},\textbf{y}\in \mathbb{R}^d$ and for all irreps $\rho$.

Figures (6)

  • Figure 1: The schematic illustrates the steerable self-attention mechanism for a single head $(h=1)$ and one query dimension $(d_K = 1)$ (left) and a steerable transformer encoder layer (right); c.f. Figure 1 by Dosovitskiy2020AnII.
  • Figure 2: Visual representation of steerable positional encoding. Arrows denote directional components, while the color gradient indicates magnitude, decaying proportionally to $r^{-2}$.
  • Figure 3: The figure demonstrates the equivariance of attention scores in a trained model. For a fixed pixel $\textbf{x}_i$, we have plotted the maximum attention score for that pixel $(\max_j \alpha(\textbf{x}_i, \textbf{x}_j))$. The different subfigures represent individual heads. The first and last heads appear to capture the object's boundary, while the other two heads focus on the object's body.
  • Figure 4: Illustration of ground truth and predicted segmentation, along with the predicted probability for the true class for two examples from each dataset. The red in the PH2 example represent the binary mask. The red, blue and green in the BraTS dataset represent enhancing tumor, tumor core and whole tumor respectively.
  • Figure 5: Examples from the ModelNet10 dataset are shown in various formats: point cloud, voxel representation, and rotated perturbations of voxels.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Lemma 1
  • proof