Table of Contents
Fetching ...

Cross-Axis Transformer with 3D Rotary Positional Embeddings

Lily Erickson

TL;DR

Cross-Axis Transformer (CAT) introduces a linear-time cross-axis attention mechanism for vision inputs, using 3D Rotary Embeddings to encode 2D spatial information with scale-invariant cross-axis encoding, achieving $O(N)$ complexity instead of $O(N^2)$. It fuses Axial Attention with Retentive Network ideas to remove Softmax and enable efficient cross-patch retention. Key contributions include Multi-Scale Rotary Axial Embeddings, Residual Imprint, and extensive ablations on ImageNet-1k, showing substantial improvements in accuracy and training efficiency with fewer FLOPs. The authors emphasize accessibility on consumer hardware, plan open-source release, and advocate broader-scale testing to validate scalability.

Abstract

Despite lagging behind their modal cousins in many respects, Vision Transformers have provided an interesting opportunity to bridge the gap between sequence modeling and image modeling. Up until now however, vision transformers have largely been held back, due to both computational inefficiency, and lack of proper handling of spatial dimensions. In this paper, we introduce the Cross-Axis Transformer. CAT is a model inspired by both Axial Transformers, and Microsoft's recent Retentive Network, that drastically reduces the required number of floating point operations required to process an image, while simultaneously converging faster and more accurately than the Vision Transformers it replaces.

Cross-Axis Transformer with 3D Rotary Positional Embeddings

TL;DR

Cross-Axis Transformer (CAT) introduces a linear-time cross-axis attention mechanism for vision inputs, using 3D Rotary Embeddings to encode 2D spatial information with scale-invariant cross-axis encoding, achieving complexity instead of . It fuses Axial Attention with Retentive Network ideas to remove Softmax and enable efficient cross-patch retention. Key contributions include Multi-Scale Rotary Axial Embeddings, Residual Imprint, and extensive ablations on ImageNet-1k, showing substantial improvements in accuracy and training efficiency with fewer FLOPs. The authors emphasize accessibility on consumer hardware, plan open-source release, and advocate broader-scale testing to validate scalability.

Abstract

Despite lagging behind their modal cousins in many respects, Vision Transformers have provided an interesting opportunity to bridge the gap between sequence modeling and image modeling. Up until now however, vision transformers have largely been held back, due to both computational inefficiency, and lack of proper handling of spatial dimensions. In this paper, we introduce the Cross-Axis Transformer. CAT is a model inspired by both Axial Transformers, and Microsoft's recent Retentive Network, that drastically reduces the required number of floating point operations required to process an image, while simultaneously converging faster and more accurately than the Vision Transformers it replaces.
Paper Structure (8 sections, 8 equations, 1 figure, 4 tables, 2 algorithms)