Table of Contents
Fetching ...

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

Yan Li, Weiwei Guo, Xue Yang, Ning Liao, Dunyun He, Jiaqi Zhou, Wenxian Yu

TL;DR

This paper proposes CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework that employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge for open-vocabulary object detection in aerial images.

Abstract

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.

Toward Open Vocabulary Aerial Object Detection with CLIP-Activated Student-Teacher Learning

TL;DR

This paper proposes CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework that employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge for open-vocabulary object detection in aerial images.

Abstract

An increasingly massive number of remote-sensing images spurs the development of extensible object detectors that can detect objects beyond training categories without costly collecting new labeled data. In this paper, we aim to develop open-vocabulary object detection (OVD) technique in aerial images that scales up object vocabulary size beyond training data. The performance of OVD greatly relies on the quality of class-agnostic region proposals and pseudo-labels for novel object categories. To simultaneously generate high-quality proposals and pseudo-labels, we propose CastDet, a CLIP-activated student-teacher open-vocabulary object Detection framework. Our end-to-end framework following the student-teacher self-learning mechanism employs the RemoteCLIP model as an extra omniscient teacher with rich knowledge. By doing so, our approach boosts not only novel object proposals but also classification. Furthermore, we devise a dynamic label queue strategy to maintain high-quality pseudo labels during batch training. We conduct extensive experiments on multiple existing aerial object detection datasets, which are set up for the OVD task. Experimental results demonstrate our CastDet achieving superior open-vocabulary detection performance, e.g., reaching 46.5% mAP on VisDroneZSD novel categories, which outperforms the state-of-the-art open-vocabulary detectors by 21.0% mAP. To our best knowledge, this is the first work to apply and develop the open-vocabulary object detection technique for aerial images. The code is available at https://github.com/lizzy8587/CastDet.
Paper Structure (15 sections, 9 equations, 5 figures, 3 tables)

This paper contains 15 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of target categories and the number of images for 18 common aerial and natural image datasets. Challenge 1: Aerial datasets are much smaller in size and category vocabularies than nature image datasets.
  • Figure 2: Challenge 2: The recall of aerial images is much lower than that of natural images. (a)(b) Aerial images from DIOR dior_li2020object. Objects in aerial images exhibit background interference. (c) Class-agnostic RPN recall statistics of novel categories in natural dataset COCO lin2014microsoft and aerial dataset VisDroneZSD VisDrone2023 (i.e., 77% v.s. 48%).
  • Figure 3: Overall architecture of CastDet. In each training iteration, the data batch consists of three data flow: labeled data with annotations, unlabeled data, and data sampled from the dynamic label queue. The labeled images are directly used for the student network training (rgb]0.84,0.894,0.805$\mathcal{L}_s$), while two sets of pseudo-labels of unlabeled data are predicted through the localization teacher and external teacher. One supervises the student (rgb]0.726,0.816,0.898$\mathcal{L}_u$), and the other is pushed into the dynamic label queue. Simultaneously, samples are randomly selected from the dynamic label queue to enhance the student's ability to detect novel targets (rgb]0.99,0.94,0.805$\mathcal{L}_d$).
  • Figure 4: Visualization of three types of box selection strategies. The figures shows the correlation among IoU, classification score, and (a) RPN score, (b) box-jittering variance, and (c) regression-jittering variance, respectively. Among them, IoU is represented by the color bar.
  • Figure 5: Workflow of dynamic label queue. Step1: filter certain high-quality proposal boxes generated by the localization teacher, and employ RemoteCLIP to classify corresponding crop images as pseudo labels. Step2: dynamically update those pseudo-labels into the queue, and randomly sample a batch of pseudo labels for the student training.