Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li; Weiwei Guo; Xue Yang; Ning Liao; Dunyun He; Jiaqi Zhou; Wenxian Yu

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

TL;DR

This paper proposes CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework that employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge for open-vocabulary object detection in aerial images.

Abstract

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

TL;DR

Abstract

Paper Structure (15 sections, 9 equations, 5 figures, 3 tables)

This paper contains 15 sections, 9 equations, 5 figures, 3 tables.

Introduction
Related Work
Methodology
Problem Description
Open Vocabulary Object Detector
Localization Teacher
Dynamic Pseudo Label Queue
Hybrid Training
Experiments
Datasets and Settings
Ablation Study
Comparison with the RemoteCLIP
Comparison with the State-of-the-Art
Evaluation on Other Dataset
Conclusion

Figures (5)

Figure 1: Comparison of target categories and the number of images for 18 common aerial and natural image datasets. Challenge 1: Aerial datasets are much smaller in size and category vocabularies than nature image datasets.
Figure 2: Challenge 2: The recall of aerial images is much lower than that of natural images. (a)(b) Aerial images from DIOR dior_li2020object. Objects in aerial images exhibit background interference. (c) Class-agnostic RPN recall statistics of novel categories in natural dataset COCO lin2014microsoft and aerial dataset VisDroneZSD VisDrone2023 (i.e., 77% v.s. 48%).
Figure 3: Overall architecture of CastDet. In each training iteration, the data batch consists of three data flow: labeled data with annotations, unlabeled data, and data sampled from the dynamic label queue. The labeled images are directly used for the student network training (rgb]0.84,0.894,0.805$\mathcal{L}_s$), while two sets of pseudo-labels of unlabeled data are predicted through the localization teacher and external teacher. One supervises the student (rgb]0.726,0.816,0.898$\mathcal{L}_u$), and the other is pushed into the dynamic label queue. Simultaneously, samples are randomly selected from the dynamic label queue to enhance the student's ability to detect novel targets (rgb]0.99,0.94,0.805$\mathcal{L}_d$).
Figure 4: Visualization of three types of box selection strategies. The figures shows the correlation among IoU, classification score, and (a) RPN score, (b) box-jittering variance, and (c) regression-jittering variance, respectively. Among them, IoU is represented by the color bar.
Figure 5: Workflow of dynamic label queue. Step1: filter certain high-quality proposal boxes generated by the localization teacher, and employ RemoteCLIP to classify corresponding crop images as pseudo labels. Step2: dynamically update those pseudo-labels into the queue, and randomly sample a batch of pseudo labels for the student training.

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

TL;DR

Abstract

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)