Table of Contents
Fetching ...

Semantic Enhanced Few-shot Object Detection

Zheng Wang, Yingjie Gao, Qingjie Liu, Yunhong Wang

TL;DR

This work proposes a fine-tuning based FSOD framework that utilizes semantic embeddings for better detection and introduces a multimodal feature fusion to augment the vision-language communication, enabling a novel class to draw support explicitly from well-trained similar base classes.

Abstract

Few-shot object detection~(FSOD), which aims to detect novel objects with limited annotated instances, has made significant progress in recent years. However, existing methods still suffer from biased representations, especially for novel classes in extremely low-shot scenarios. During fine-tuning, a novel class may exploit knowledge from similar base classes to construct its own feature distribution, leading to classification confusion and performance degradation. To address these challenges, we propose a fine-tuning based FSOD framework that utilizes semantic embeddings for better detection. In our proposed method, we align the visual features with class name embeddings and replace the linear classifier with our semantic similarity classifier. Our method trains each region proposal to converge to the corresponding class embedding. Furthermore, we introduce a multimodal feature fusion to augment the vision-language communication, enabling a novel class to draw support explicitly from well-trained similar base classes. To prevent class confusion, we propose a semantic-aware max-margin loss, which adaptively applies a margin beyond similar classes. As a result, our method allows each novel class to construct a compact feature space without being confused with similar base classes. Extensive experiments on Pascal VOC and MS COCO demonstrate the superiority of our method.

Semantic Enhanced Few-shot Object Detection

TL;DR

This work proposes a fine-tuning based FSOD framework that utilizes semantic embeddings for better detection and introduces a multimodal feature fusion to augment the vision-language communication, enabling a novel class to draw support explicitly from well-trained similar base classes.

Abstract

Few-shot object detection~(FSOD), which aims to detect novel objects with limited annotated instances, has made significant progress in recent years. However, existing methods still suffer from biased representations, especially for novel classes in extremely low-shot scenarios. During fine-tuning, a novel class may exploit knowledge from similar base classes to construct its own feature distribution, leading to classification confusion and performance degradation. To address these challenges, we propose a fine-tuning based FSOD framework that utilizes semantic embeddings for better detection. In our proposed method, we align the visual features with class name embeddings and replace the linear classifier with our semantic similarity classifier. Our method trains each region proposal to converge to the corresponding class embedding. Furthermore, we introduce a multimodal feature fusion to augment the vision-language communication, enabling a novel class to draw support explicitly from well-trained similar base classes. To prevent class confusion, we propose a semantic-aware max-margin loss, which adaptively applies a margin beyond similar classes. As a result, our method allows each novel class to construct a compact feature space without being confused with similar base classes. Extensive experiments on Pascal VOC and MS COCO demonstrate the superiority of our method.
Paper Structure (13 sections, 8 equations, 4 figures, 5 tables)

This paper contains 13 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The illustration of our main idea. a) The original feature space, where 'cow' is a novel class, 'sheep' and 'horse' are confusable base classes. b) The semantic alignment learning aligns visual space with semantic space by bringing RoI features closer to their class name embeddings. c) A max-margin loss is further proposed to push confusable classes away from each other.
  • Figure 2: The overview of our method. In the base training stage, we follow previous method to train a linear classifier on the base set. In the novel fine-tuning stage, we initialize the model with base knowledge and replace the linear classifier with our semantic similarity classifier. Additionally, multimodal feature fusion is propose to improve the vision-language communication. Finally the classifier branch is optimized by our semantic-aware max-margin loss.
  • Figure 3: The confusion matrix on Pascal VOC Split1. Each element in column $m$ and row $n$ indicates the percentage of samples in class $m$ that are recognized as class $n$. If $m$ and $n$ stand for different classes, a high score would indicate severe confusion. Our approach alleviates the confusion between novel classes and similar base classes by a large margin.
  • Figure 4: Visualization of detection results on the VOC Split1. Our method detects objects that DeFRCN misses (left) and confuses (middle and right).