Steerable Transformers for Volumetric Data
Soumyabrata Kundu, Risi Kondor
TL;DR
Steerable Transformers extend Vision Transformers to volumetric data by preserving SE$(d)$-equivariance through a Fourier-space, steerable-attention mechanism built on top of a steerable convolutional encoder. The approach introduces steerable positional encodings and Fourier-domain nonlinearities to maintain symmetry while efficiently coupling local and global representations. Empirical results across RotMNIST, ModelNet10, PH2, and BraTS show consistent improvements over steerable-convolution baselines, validating the integration of equivariant transformers with structured volumetric data. The framework promises robust, geometry-aware analysis with potential impact in high-stakes domains like medical imaging where rotational and translational invariance are critical.
Abstract
We introduce Steerable Transformers, an extension of the Vision Transformer mechanism that maintains equivariance to the special Euclidean group $\mathrm{SE}(d)$. We propose an equivariant attention mechanism that operates on features extracted by steerable convolutions. Operating in Fourier space, our network utilizes Fourier space non-linearities. Our experiments in both two and three dimensions show that adding steerable transformer layers to steerable convolutional networks enhances performance.
