A DeNoising FPN With Transformer R-CNN for Tiny Object Detection

Hou-I Liu; Yu-Wen Tseng; Kai-Cheng Chang; Pin-Jyun Wang; Hong-Han Shuai; Wen-Huang Cheng

A DeNoising FPN With Transformer R-CNN for Tiny Object Detection

Hou-I Liu, Yu-Wen Tseng, Kai-Cheng Chang, Pin-Jyun Wang, Hong-Han Shuai, Wen-Huang Cheng

TL;DR

This work tackles the challenge of tiny object detection in aerial imagery by introducing DNTR, a framework that combines De-Noising FPN (DN-FPN) with geometric-semantic contrastive learning and a Transformer-based two-stage detector (Trans R-CNN). DN-FPN suppresses noise during FPN fusion, while Trans R-CNN enhances local and global representations within RoIs through shuffle unfolding and a Mask Transformer Encoder. The method achieves substantial improvements on AI-TOD (AP$_{vt}$ gains) and VisDrone, with competitive performance on COCO, and demonstrates that DN-FPN is a versatile plug-in for other detectors. Overall, DNTR sets a new benchmark for tiny object detection, highlighting the potential of combining denoising feature fusion with transformer-based RoI processing in remote sensing applications.

Abstract

Despite notable advancements in the field of computer vision, the precise detection of tiny objects continues to pose a significant challenge, largely owing to the minuscule pixel representation allocated to these objects in imagery data. This challenge resonates profoundly in the domain of geoscience and remote sensing, where high-fidelity detection of tiny objects can facilitate a myriad of applications ranging from urban planning to environmental monitoring. In this paper, we propose a new framework, namely, DeNoising FPN with Trans R-CNN (DNTR), to improve the performance of tiny object detection. DNTR consists of an easy plug-in design, DeNoising FPN (DN-FPN), and an effective Transformer-based detector, Trans R-CNN. Specifically, feature fusion in the feature pyramid network is important for detecting multiscale objects. However, noisy features may be produced during the fusion process since there is no regularization between the features of different scales. Therefore, we introduce a DN-FPN module that utilizes contrastive learning to suppress noise in each level's features in the top-down path of FPN. Second, based on the two-stage framework, we replace the obsolete R-CNN detector with a novel Trans R-CNN detector to focus on the representation of tiny objects with self-attention. Experimental results manifest that our DNTR outperforms the baselines by at least 17.4% in terms of APvt on the AI-TOD dataset and 9.6% in terms of AP on the VisDrone dataset, respectively. Our code will be available at https://github.com/hoiliu-0801/DNTR.

A DeNoising FPN With Transformer R-CNN for Tiny Object Detection

TL;DR

gains) and VisDrone, with competitive performance on COCO, and demonstrates that DN-FPN is a versatile plug-in for other detectors. Overall, DNTR sets a new benchmark for tiny object detection, highlighting the potential of combining denoising feature fusion with transformer-based RoI processing in remote sensing applications.

Abstract

Paper Structure (23 sections, 12 equations, 9 figures, 11 tables, 1 algorithm)

This paper contains 23 sections, 12 equations, 9 figures, 11 tables, 1 algorithm.

Introduction
Method
DN-FPN
Trans R-CNN
Overall Objective
Experiments
Experimental Setup
Dataset
Implementation Details
Evaluation Metrics
Comparisons with State-of-the-Art Methods
Experiment on AI-TOD
Experiments on VisDrone
Experiments on COCO
Ablation Studies
...and 8 more sections

Figures (9)

Figure 1: Conventional FPN structure. Following FPN fpn, the fusion features are combined with the lateral features and upper-level features. The main purpose is to aggregate the geometric and semantic information from the low-resolution and high-resolution features to garner better multiscale features. However, the channel reduction (1x1 Conv.) and upsampling (Up. x2) cause noise and damage the geometric and semantic information in FPN, respectively. Note that the Conv. and Up. denote the convolution layer and the upsampling operation.
Figure 2: Comparison of different object detection frameworks. (a) CNN-based two-stage model detectors generates the RoIs (Three different shades of blue) by RPN and applies R-CNN head to predict objects. (b) DETR-like model detr flattens the visual feature into image patches and passes them through the transformer decoder via cross-attention to transform the object queries into the final bounding boxes. (c) Our DNTR extracts less noisy multiscale features by the DN-FPN module. Subsequently, the shuffle unfolding and Trans R-CNN are employed to capture local and global information within an RoI, resulting in better detection outcomes for tiny objects.
Figure 3: Overall architecture of DNTR. We use $C_{i}$ and $P_{i}$ to represent the multiscale features from the backbone and DN-FPN (Eq. \ref{['eq:1']}), respectively, where i denotes the levels of the multiscale feature. (a) DN-FPN module, which suppresses redundant features by geometric-semantic contrastive learning. Note that the Geo. and Sem. represent the geometric and semantic, respectively. (b) Trans R-CNN head, which aims to utilize the surrounding information and capture rich long-range dependencies within an RoI.
Figure 4: An illustration of geometric and semantic relations of DN-FPN. We set the $g^p_{0,1}$ and $s^p_{0,1}$ as the example queries (black frames) of the geometric and semantic representation of the lowest fusion feature $P_{0,1}$. The geometric and semantic relations of positive and negative samples are shown in the light blue and light yellow oval regions, respectively. It is important to note that the queries, positive and negative samples, are computed independently in the geometric and semantic relations, highlighted in solid and cross frames. In addition, Other denotes the representations that are not included in the loss function.
Figure 5: The structure of Trans R-CNN, which is composed of shuffle unfolding, a mask transformer encoder (MTE), and a task token selection mechanism. Please note that the $G_c$ and $G_b$ denote class-related and box-related groups.
...and 4 more figures

A DeNoising FPN With Transformer R-CNN for Tiny Object Detection

TL;DR

Abstract

A DeNoising FPN With Transformer R-CNN for Tiny Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (9)