Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Ali K. Rahimian; Manish K. Govind; Subhajit Maity; Dominick Reilly; Christian Kümmerle; Srijan Das; Aritra Dutta

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Ali K. Rahimian, Manish K. Govind, Subhajit Maity, Dominick Reilly, Christian Kümmerle, Srijan Das, Aritra Dutta

TL;DR

Fibottention introduces a fixed, head-diverse sparse self-attention mechanism for Vision Transformers that reduces self-attention complexity from $O(N^2)$ to $O(N \log N)$ by using Fibonacci-based dilations drawn from the Wythoff array. Each attention head uses a distinct, non-overlapping sparsity pattern across heads, promoting diverse and complementary feature representations and enabling efficient inference with minimal loss (or even gains) in accuracy across image, video, and robotics tasks. Empirical results show Fibottention achieves competitive performance with only a few percent of token interactions compared to dense MHSA, and outperforms several sparse-attention baselines on multiple datasets and backbones, while offering robustness to corruptions. The work also provides thorough ablations on head diversity, inductive bias, and dilation choices, highlighting the practical impact of structured, head-wise diversity for efficient representation learning in visual domains with limited data. Overall, Fibottention offers a scalable, architecture-agnostic approach to efficient visual Transformers with potential applicability to larger-scale models and other domains requiring causal or long-range attention.

Abstract

Vision Transformers and their variants have achieved remarkable success in diverse visual perception tasks. Despite their effectiveness, they suffer from two significant limitations. First, the quadratic computational complexity of multi-head self-attention (MHSA), which restricts scalability to large token counts, and second, a high dependency on large-scale training data to attain competitive performance. In this paper, to address these challenges, we propose a novel sparse self-attention mechanism named Fibottention. Fibottention employs structured sparsity patterns derived from the Wythoff array, enabling an $\mathcal{O}(N \log N)$ computational complexity in self-attention. By design, its sparsity patterns vary across attention heads, which provably reduces redundant pairwise interactions while ensuring sufficient and diverse coverage. This leads to an \emph{inception-like functional diversity} in the attention heads, and promotes more informative and disentangled representations. We integrate Fibottention into standard Transformer architectures and conduct extensive experiments across multiple domains, including image classification, video understanding, and robot learning. Results demonstrate that models equipped with Fibottention either significantly outperform or achieve on-par performance with their dense MHSA counterparts, while leveraging only $2\%$ of all pairwise interactions across self-attention heads in typical settings, $2-6\%$ of the pairwise interactions in self-attention heads, resulting in substantial computational savings. Moreover, when compared to existing sparse attention mechanisms, Fibottention consistently achieves superior results on a FLOP-equivalency basis. Finally, we provide an in-depth analysis of the enhanced feature diversity resulting from our attention design and discuss its implications for efficient representation learning.

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

TL;DR

Fibottention introduces a fixed, head-diverse sparse self-attention mechanism for Vision Transformers that reduces self-attention complexity from

by using Fibonacci-based dilations drawn from the Wythoff array. Each attention head uses a distinct, non-overlapping sparsity pattern across heads, promoting diverse and complementary feature representations and enabling efficient inference with minimal loss (or even gains) in accuracy across image, video, and robotics tasks. Empirical results show Fibottention achieves competitive performance with only a few percent of token interactions compared to dense MHSA, and outperforms several sparse-attention baselines on multiple datasets and backbones, while offering robustness to corruptions. The work also provides thorough ablations on head diversity, inductive bias, and dilation choices, highlighting the practical impact of structured, head-wise diversity for efficient representation learning in visual domains with limited data. Overall, Fibottention offers a scalable, architecture-agnostic approach to efficient visual Transformers with potential applicability to larger-scale models and other domains requiring causal or long-range attention.

Abstract

computational complexity in self-attention. By design, its sparsity patterns vary across attention heads, which provably reduces redundant pairwise interactions while ensuring sufficient and diverse coverage. This leads to an \emph{inception-like functional diversity} in the attention heads, and promotes more informative and disentangled representations. We integrate Fibottention into standard Transformer architectures and conduct extensive experiments across multiple domains, including image classification, video understanding, and robot learning. Results demonstrate that models equipped with Fibottention either significantly outperform or achieve on-par performance with their dense MHSA counterparts, while leveraging only

of all pairwise interactions across self-attention heads in typical settings,

of the pairwise interactions in self-attention heads, resulting in substantial computational savings. Moreover, when compared to existing sparse attention mechanisms, Fibottention consistently achieves superior results on a FLOP-equivalency basis. Finally, we provide an in-depth analysis of the enhanced feature diversity resulting from our attention design and discuss its implications for efficient representation learning.

Paper Structure (36 sections, 3 theorems, 38 equations, 8 figures, 12 tables, 3 algorithms)

This paper contains 36 sections, 3 theorems, 38 equations, 8 figures, 12 tables, 3 algorithms.

Introduction
Related Work
Method
Sparse Attention with Windowed Dilation
Diverse Sparse Attention through Wythoff–Fibonacci Dilation Sequences
Fibottention for Image Classification
Experimental Setup
Experimental Results
Fibottention in Other Visual Domains
Video Action Classification
Robot Learning
Ablation Studies and Analytical Findings
Validation of Head Diversity
Validation of Inductive Bias
Computational Complexity
...and 21 more sections

Key Result

Lemma 1

If $\text{Fib}(a,b) = (f_n)_{n \in \mathbb{N}}$ is the generalized Fibonacci sequence with initial values $f_1 = a$ and $f_2 = b$, then it holds that for each $n \geq 1$ and for each $n \geq 2$, where $\phi = (1+\sqrt{5})/2$ and $\psi= (1- \sqrt{5})/2$.

Figures (8)

Figure 1: (a) The MHSA. (b) A general sparse attention computation strategy. A sequence of sparse support sets, $\{\Omega_i\}_{i=1}^h$, where each set selects $|\Omega_i|<N^2$ entries of the attention matrix. (c) The generalized masking strategy of Fibottention that controls sparsity of each attention matrix ${A}_{i}$ through a dilated sequence, $(f_n)_n\subset \mathbb{N}$, and a fixed window size, $w$ for each head. Elements on $f_1$ and $f_2$ occur exclusively in the Modified Wythoff variant.
Figure 2: Sample frames from the datasets used in our robotics experiments.
Figure 3: Inference costs of ViT-B, ViT-T, ConViT-B (Vanilla vs. Fibottention).
Figure 4: Test accuracy of ViT-B models for corrupted datasets CIFAR-10 C and CIFAR-100 C, trained with and without Fibottention
Figure 5: Effect of different dilation sequences $(f_n)_{n\in\mathbb{N}}$ on CIFAR-10 cifar and CIFAR-100 cifar, with a fixed window size $w_i = N/3$.
...and 3 more figures

Theorems & Definitions (6)

Lemma 1: Generalized Binet's Formula Koshy2019fibonacci
proof : Proof of Lemma \ref{['lemma:generalized:binet']}
Lemma 2
proof
Theorem 3
proof

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

TL;DR

Abstract

Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (6)