Table of Contents
Fetching ...

Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery

Xavier Bou, Gabriele Facciolo, Rafael Grompone von Gioi, Jean-Michel Morel, Thibaud Ehret

TL;DR

A few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier, which outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.

Abstract

The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.

Exploring Robust Features for Few-Shot Object Detection in Satellite Imagery

TL;DR

A few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier, which outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.

Abstract

The goal of this paper is to perform object detection in satellite imagery with only a few examples, thus enabling users to specify any object class with minimal annotation. To this end, we explore recent methods and ideas from open-vocabulary detection for the remote sensing domain. We develop a few-shot object detector based on a traditional two-stage architecture, where the classification block is replaced by a prototype-based classifier. A large-scale pre-trained model is used to build class-reference embeddings or prototypes, which are compared to region proposal contents for label prediction. In addition, we propose to fine-tune prototypes on available training images to boost performance and learn differences between similar classes, such as aircraft types. We perform extensive evaluations on two remote sensing datasets containing challenging and rare objects. Moreover, we study the performance of both visual and image-text features, namely DINOv2 and CLIP, including two CLIP models specifically tailored for remote sensing applications. Results indicate that visual features are largely superior to vision-language models, as the latter lack the necessary domain-specific vocabulary. Lastly, the developed detector outperforms fully supervised and few-shot methods evaluated on the SIMD and DIOR datasets, despite minimal training parameters.
Paper Structure (13 sections, 5 equations, 5 figures, 6 tables)

This paper contains 13 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Performance (mAP) of the proposed detector with DINOv2 features on the SIMD dataset, compared to YOLOv5 for different amounts of available examples per class. Robust visual features largely outperform state-of-the-art supervised methods when annotated data is limited.
  • Figure 2: Building a class reference prototype for the aircraft category propeller with four examples. The frozen pre-trained backbone is used to extract image representations. Then, patches overlapping box annotations are averaged into one single vector. Lastly, all four embeddings are combined into a reference vector via averaging and normalization.
  • Figure 3: General diagram of our detector. An input image is fed to the RPN to generate region proposals, as well as to the backbone to extract high-level representations. Then, cosine similarity maps are generated using the features and pre-computed prototypes. For each region proposal, the mean average similarity with each prototype is computed, and the proposal is then classified as the most similar prototype class. Lastly, we discard boxes classified as a background prototype and apply non-maximum suppression.
  • Figure 4: Illustrative qualitative results obtained by the proposed detector. Images on the top row correspond to the SIMD dataset, while images on the bottom belong to the DIOR dataset.
  • Figure 5: T-SNE visualization of the learned prototypes for the SIMD dataset using $N=10$, before and after fine-tuning. Plane or aircraft types are shown with a star marker, while types of terrestrial vehicles are shown with a cross marker. The boat class is shown as a diamond. As depicted, class separation increases after fine-tuning, e.g. stair-truck and pushback-truck are more separable after training. In addition, each cluster representing a group of transportation exhibits close proximity yet remains distinguishable, whereas the separation between other groups is more pronounced.