Table of Contents
Fetching ...

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

Yoshihiro Yamada

TL;DR

CAT introduces a sub-quadratic, softmax-preserving attention mechanism for Transformers by formulating a circulant attention kernel and computing it in the frequency domain with FFT/IFFT. Framed within Engineering-Isomorphic Transformers (EITs), CAT reduces attention-map materialization and memory while maintaining global context, achieving $O(N\log N)$ complexity without sequence-length-dependent hyperparameters. Empirically, CAT matches or surpasses standard attention on ImageNet-1k and WikiText-103, with a notable strength in masked language modeling and in scenarios with simple token mixing; a hybrid CAT-Alter variant often outperforms vanilla attention or CAT alone. Ablation studies show merging query and key projections (qv) balances accuracy and parameter count, and that partial substitution can yield practical efficiency gains. The work suggests a practical path toward scalable Transformers and informs future design of high-performance attention architectures.

Abstract

Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully connected layers, and introduces no additional heavy operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations. Based on the Engineering-Isomorphic Transformers (EITs) framework, CAT's design not only offers practical efficiency and ease of implementation, but also provides insights to guide the development of future high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT's success, shedding light on broader principles for scalable attention mechanisms.

CAT: Circular-Convolutional Attention for Sub-Quadratic Transformers

TL;DR

CAT introduces a sub-quadratic, softmax-preserving attention mechanism for Transformers by formulating a circulant attention kernel and computing it in the frequency domain with FFT/IFFT. Framed within Engineering-Isomorphic Transformers (EITs), CAT reduces attention-map materialization and memory while maintaining global context, achieving complexity without sequence-length-dependent hyperparameters. Empirically, CAT matches or surpasses standard attention on ImageNet-1k and WikiText-103, with a notable strength in masked language modeling and in scenarios with simple token mixing; a hybrid CAT-Alter variant often outperforms vanilla attention or CAT alone. Ablation studies show merging query and key projections (qv) balances accuracy and parameter count, and that partial substitution can yield practical efficiency gains. The work suggests a practical path toward scalable Transformers and informs future design of high-performance attention architectures.

Abstract

Transformers have driven remarkable breakthroughs in natural language processing and computer vision, yet their standard attention mechanism still imposes O(N^2) complexity, hindering scalability to longer sequences. We introduce Circular-convolutional ATtention (CAT), a Fourier-based approach that efficiently applies circular convolutions to reduce complexity without sacrificing representational power. CAT achieves O(NlogN) computations, requires fewer learnable parameters by streamlining fully connected layers, and introduces no additional heavy operations, resulting in consistent accuracy improvements and about a 10% speedup in naive PyTorch implementations. Based on the Engineering-Isomorphic Transformers (EITs) framework, CAT's design not only offers practical efficiency and ease of implementation, but also provides insights to guide the development of future high-performance Transformer architectures. Finally, our ablation studies highlight the key conditions underlying CAT's success, shedding light on broader principles for scalable attention mechanisms.

Paper Structure

This paper contains 24 sections, 8 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: From Self-Attention to CAT with two implementations. Left: standard Self-Attention with a dense $N{\times}N$ attention map ($O(N^2)$). Middle: CAT ($O(N^2)$), a softmax-preserving circulant form of attention that reduces intermediate computations but remains quadratic overall. Right: CAT ($O(N\log N)$), the same circulant attention computed in the frequency domain using the Fast Fourier Transform (FFT), its inverse (IFFT), and an element-wise Hadamard product, achieving sub-quadratic complexity.
  • Figure 2: Ablation study comparing different parameterization strategies for query, key, and value (qkv, qv, q, v). Although fully splitting qkv (Averaged-Key) can yield slightly higher accuracy, it reintroduces attention-level parameter budgets. Our qv variant (CAT) strikes a practical balance, maintaining sub-quadratic complexity and competitive performance.
  • Figure 3: Attention-map visualizations using CLIP-B (average pooling). Top-left: input image. Bottom row (left$\to$right): Self-Attention, CAT, and CAT-Alter. Each panel shows 196$\times$196 token-to-token attention maps arranged as a 12$\times$12 grid: rows correspond to attention heads ($h{=}1\ldots12$) within each layer, and columns correspond to layers ($l{=}1\ldots12$) from top to bottom. All maps share the same colormap and are normalized by min--max scaling.