FViT: A Focal Vision Transformer with Gabor Filter

Yulong Shi; Mingwei Sun; Yongshuai Wang; Zengqiang Chen

FViT: A Focal Vision Transformer with Gabor Filter

Yulong Shi, Mingwei Sun, Yongshuai Wang, Zengqiang Chen

TL;DR

The paper tackles the high computational cost and local-detail limitations of self-attention in vision transformers by introducing a Learnable Gabor Filter (LGF) as a convolution-based alternative, combined with a Bionic Focal Vision (BFV) block and a Dual-Path Feed Forward Network (DPFFN). This trio forms the FViT pyramid backbone, with four variants designed for scalable, efficient performance. Across ImageNet-1K, COCO, and ADE20K, FViTs demonstrate competitive or superior accuracy with favorable compute, underscored by ablations showing DPFFN and LGF contribute meaningfully to representation and efficiency. The work suggests that leveraging biologically inspired processing and learnable Gabor-like responses can yield strong, resource-efficient vision transformers suitable for dense prediction tasks.

Abstract

Vision transformers have achieved encouraging progress in various computer vision tasks. A common belief is that this is attributed to the capability of self-attention in modeling the global dependencies among feature tokens. However, self-attention still faces several challenges in dense prediction tasks, including high computational complexity and absence of desirable inductive bias. To alleviate these issues, the potential advantages of combining vision transformers with Gabor filters are revisited, and a learnable Gabor filter (LGF) using convolution is proposed. The LGF does not rely on self-attention, and it is used to simulate the response of fundamental cells in the biological visual system to the input images. This encourages vision transformers to focus on discriminative feature representations of targets across different scales and orientations. In addition, a Bionic Focal Vision (BFV) block is designed based on the LGF. This block draws inspiration from neuroscience and introduces a Dual-Path Feed Forward Network (DPFFN) to emulate the parallel and cascaded information processing scheme of the biological visual cortex. Furthermore, a unified and efficient family of pyramid backbone networks called Focal Vision Transformers (FViTs) is developed by stacking BFV blocks. Experimental results indicate that FViTs demonstrate superior performance in various vision tasks. In terms of computational efficiency and scalability, FViTs show significant advantages compared with other counterparts.

FViT: A Focal Vision Transformer with Gabor Filter

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Transformers for Vision
2D Gabor Filter
Approach
Overall Architecture
Bionic Focal Vision Block
Learnable Gabor Filter
Dual-Path Feed Forward Network
Architecture Variants of FViTs
Experiments
Image Classification on ImageNet-1k
Object Detection and Instance Segmentation
Semantic Segmentation on ADE20K
Ablation Studies
...and 1 more sections

Figures (8)

Figure 1: Comparison of Top-1 accuracies achieved by the FViTs and other baselines on the ImageNet-1K dataset. Compared with other counterparts, FViTs achieve a better trade-off in terms of parameters, computational complexity and performance on the ImageNet-1K classification task.
Figure 2: Overall architecture of the Focal Vision Transformer (FViT). FViT is composed of a convolutional stem and a pyramid structure with four stages. Each stage includes of a $2 \times 2$ convolution with stride 2 and multiple Bionic Focal Vision (BFV) blocks. The BFV block consists of a Convolutional Positional Embedding (CPE), a Learnable Gabor Filter (LGF) and a Dual-Path Feed Forward Network (DPFFN).
Figure 3: Illustration of the Learnable Gabor Filter (LGF)
Figure 4: Illustration of the forward computation process and gradient backpropagation process of LGF.
Figure 5: Visualization of the Dual-Path Feed Forward Network (DPFFN).
...and 3 more figures

FViT: A Focal Vision Transformer with Gabor Filter

TL;DR

Abstract

FViT: A Focal Vision Transformer with Gabor Filter

Authors

TL;DR

Abstract

Table of Contents

Figures (8)