EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Yulong Shi; Mingwei Sun; Yongshuai Wang; Jiahao Ma; Zengqiang Chen

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Yulong Shi, Mingwei Sun, Yongshuai Wang, Jiahao Ma, Zengqiang Chen

TL;DR

This paper addresses the efficiency and inductive-bias challenges of Vision Transformers by drawing inspiration from the eagle’s bi-fovea vision. It introduces Bi-Fovea Visual Interaction (BFVI), with Bi-Fovea Self-Attention (BFSA) and Bi-Fovea Feedforward Network (BFFN), embedded in Bionic Eagle Vision (BEV) blocks to form the Eagle Vision Transformer (EViT) family. The resulting pyramid backbones achieve strong performance across image classification, object detection, and semantic segmentation while maintaining computational efficiency, validated on ImageNet-1K, COCO, and ADE20K, along with robustness and transfer-learning assessments. The work demonstrates the practicality of biologically inspired, coarse-to-fine multi-scale processing for scalable vision models and lays groundwork for further efficiency and interpretability studies.

Abstract

Owing to advancements in deep learning technology, Vision Transformers (ViTs) have demonstrated impressive performance in various computer vision tasks. Nonetheless, ViTs still face some challenges, such as high computational complexity and the absence of desirable inductive biases. To alleviate these issues, {the potential advantages of combining eagle vision with ViTs are explored. We summarize a Bi-Fovea Visual Interaction (BFVI) structure inspired by the unique physiological and visual characteristics of eagle eyes. A novel Bi-Fovea Self-Attention (BFSA) mechanism and Bi-Fovea Feedforward Network (BFFN) are proposed based on this structural design approach, which can be used to mimic the hierarchical and parallel information processing scheme of the biological visual cortex, enabling networks to learn feature representations of targets in a coarse-to-fine manner. Furthermore, a Bionic Eagle Vision (BEV) block is designed as the basic building unit based on the BFSA mechanism and BFFN. By stacking BEV blocks, a unified and efficient family of pyramid backbone networks called Eagle Vision Transformers (EViTs) is developed. Experimental results show that EViTs exhibit highly competitive performance in various computer vision tasks, such as image classification, object detection and semantic segmentation. Compared with other approaches, EViTs have significant advantages, especially in terms of performance and computational efficiency. Code is available at https://github.com/nkusyl/EViT

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 8 figures, 8 tables)

This paper contains 19 sections, 10 equations, 8 figures, 8 tables.

Introduction
Related Work
Biological Eagle Vision
Transformers for Vision
Approach
Overall Architecture
Bionic Eagle Vision Block
Bi-Fovea Self-Attention
Complexity Analysis of the BFSA
Bi-Fovea Feedforward Network
Architectural Variants of EViTs
Experiments
Image Classification on ImageNet-1k
Object Detection and Instance Segmentation
Semantic Segmentation on ADE20K
...and 4 more sections

Figures (8)

Figure 1: (a) The physiological structure of the bi-fovea in eagle eye; (b) The density distribution of the photoreceptor cells contained in the bi-fovea.
Figure 2: Comparison among the Top-1 accuracies achieved by the EViTs and other baselines on the ImageNet-1K dataset. Compared with their counterparts, the EViTs achieve better trade-offs among the number of parameters, the computational complexity and performance.
Figure 3: (a) The visual neural pathway of the eagle eyes; (b) The Bi-Fovea Visual Interaction (BFVI) structure.
Figure 4: Illustration of the EViT. The EViT is composed of a convolutional stem and a pyramid structure with four stages. Each stage includes a 2 × 2 convolution with a stride of 2 and multiple Bionic Eagle Vision (BEV) blocks. Each BEV block consists of three key components: a Convolutional Positional Embedding (CPE), a Bi-Fovea Self-Attention (BFSA) and a Bi-Fovea Feedforward Network (BFFN).
Figure 5: Illustration of the BFSA. The BFSA consists of a Shallow Fovea Attention (SFA) and a Deep Fovea Attention (DFA)
...and 3 more figures

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

TL;DR

Abstract

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (8)