Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images

Zewen Du; Zhenjiang Hu; Guiyu Zhao; Ying Jin; Hongbin Ma

Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images

Zewen Du, Zhenjiang Hu, Guiyu Zhao, Ying Jin, Hongbin Ma

TL;DR

This work tackles the challenge of small object detection in aerial imagery by introducing CFPT, an upsampler-free feature pyramid transformer that enables cross-layer interactions in a single step. CFPT employs two linear-complexity attention blocks, Cross-layer Channel-wise Attention (CCA) and Cross-layer Spatial-wise Attention (CSA), along with Cross-layer Consistent Relative Positional Encoding (CCPE) to preserve spatial and channel relations across layers. Extensive experiments on VisDrone2019-DET, TinyPerson, and xView show that CFPT consistently improves detection accuracy while reducing computational cost compared to state-of-the-art feature pyramids and detectors. The proposed components collectively enable better global contextual modeling and shallow-feature emphasis, making CFPT an efficient and effective neck for small object detection in aerial imagery.

Abstract

Object detection in aerial images has always been a challenging task due to the generally small size of the objects. Most current detectors prioritize the development of new detection frameworks, often overlooking research on fundamental components such as feature pyramid networks. In this paper, we introduce the Cross-Layer Feature Pyramid Transformer (CFPT), a novel upsampler-free feature pyramid network designed specifically for small object detection in aerial images. CFPT incorporates two meticulously designed attention blocks with linear computational complexity: Cross-Layer Channel-Wise Attention (CCA) and Cross-Layer Spatial-Wise Attention (CSA). CCA achieves cross-layer interaction by dividing channel-wise token groups to perceive cross-layer global information along the spatial dimension, while CSA enables cross-layer interaction by dividing spatial-wise token groups to perceive cross-layer global information along the channel dimension. By integrating these modules, CFPT enables efficient cross-layer interaction in a single step, thereby avoiding the semantic gap and information loss associated with element-wise summation and layer-by-layer transmission. In addition, CFPT incorporates global contextual information, which improves detection performance for small objects. To further enhance location awareness during cross-layer interaction, we propose the Cross-Layer Consistent Relative Positional Encoding (CCPE) based on inter-layer mutual receptive fields. We evaluate the effectiveness of CFPT on three challenging object detection datasets in aerial images: VisDrone2019-DET, TinyPerson, and xView. Extensive experiments demonstrate that CFPT outperforms state-of-the-art feature pyramid networks while incurring lower computational costs. The code is available at https://github.com/duzw9311/CFPT.

Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images

TL;DR

Abstract

Paper Structure (33 sections, 13 equations, 12 figures, 9 tables)

This paper contains 33 sections, 13 equations, 12 figures, 9 tables.

Introduction
Related Work
Small Object Detection in Aerial Images
Feature Pyramid Network
Vision Transformer
Methodology
Overview
Cross-layer Channel-wise Attention
Cross-layer Spatial-wise Attention
Cross-layer Consistent Relative Positional Encoding
Complexity Analysis
Cross-layer Channel-wise Attention
Cross-layer Spatial-wise Attention
Experiments
Datasets
...and 18 more sections

Figures (12)

Figure 1: Box plot of scale distribution for (a) VisDrone2019-DET dataset du2019visdrone and (b) TinyPerson dataset yu2020scale. The ordinate represents the category of the annotation bounding boxes, and the abscissa represents the square root of the area of the annotation bounding boxes (i.e., $\sqrt{W\times H}$). For clarity, we remove outliers outside the $1.5\times$ Interquartile Range (IQR).
Figure 2: Comparison of the structures and visual feature maps of various feature pyramid networks, including FPN lin2017feature, PAFPN liu2018path, AFPN yang2023afpn and our CFPT. The "Baseline" refers to RetinaNet lin2017focal without the feature pyramid network (i.e., using vanilla convolutional layers to generate multi-scale feature maps), with red rectangles indicating the ground truths in the current image. Our CFPT could effectively focus on multi-scale objects, even those with small scales, while AFPN tends to prioritize larger objects and overlook smaller ones. "Down" and "Up" denote downsampling and upsampling operations, respectively. Note that our CFPT does not involve upsampling. Best viewed in color and zoomed in for clarity.
Figure 3: Performance comparison of various state-of-the-art feature pyramid networks on the VisDrone2019-DET dataset. We evaluate their performance by replacing the Neck component in RetinaNet lin2017focal.
Figure 4: Overall architecture of proposed Cross-layer Feature Pyramid Transformer (CFPT). Given an input image with the shape of $H\times W \times 3$, we apply Cross-layer Channel-wise Attention (CCA) and Cross-layer Spatial-wise Attention (CSA) multiple times on feature maps downsampled by factors of 8, 16, 32, and 64 to capture cross-layer global contextual information and perform cross-layer adaptive feature correction.
Figure 5: The illustrative diagram of cross-layer neighboring interactions, where blocks of the same color represent tokens at different layers that need to interact. The dashed coordinate system represents the hidden mixed direction of the feature map.
...and 7 more figures

Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images

TL;DR

Abstract

Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images

Authors

TL;DR

Abstract

Table of Contents

Figures (12)