Table of Contents
Fetching ...

Detect Everything with Few Examples

Xinyu Zhang, Yuhan Liu, Yuting Wang, Abdeslam Boularias

TL;DR

This paper introduces DE-ViT, a few-shot object detector without the need for finetuning, based on a new region-propagation mechanism for localization that establishes new state-of-the-art results on all benchmarks.

Abstract

Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 17 box APr. Further, we evaluate DE-ViT with a real robot by building a pick-and-place system for sorting novel objects based on example images. The videos of our robot demonstrations, the source code and the models of DE-ViT can be found at https://mlzxy.github.io/devit.

Detect Everything with Few Examples

TL;DR

This paper introduces DE-ViT, a few-shot object detector without the need for finetuning, based on a new region-propagation mechanism for localization that establishes new state-of-the-art results on all benchmarks.

Abstract

Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 17 box APr. Further, we evaluate DE-ViT with a real robot by building a pick-and-place system for sorting novel objects based on example images. The videos of our robot demonstrations, the source code and the models of DE-ViT can be found at https://mlzxy.github.io/devit.
Paper Structure (21 sections, 9 equations, 16 figures, 11 tables)

This paper contains 21 sections, 9 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Demonstration of the proposed method on YCB objects calli2015ycb. DE-ViT with ViT-L/14 is used for prediction. Note that our model is trained on only the base categories of LVIS. Example images of YCB objects are provided only during inference to represent novel categories.
  • Figure 2: Overview of the proposed method. Our approach uses DINOv2 ViT to encode the image into a feature map, from which proposal features are extracted using ROIAlign. Proposals are generated via an off-the-shelf RPN. Prototype projection transforms proposal features into similarity maps based on prototypes derived from ViT features of support images. Multi-class classification of proposals is recast as a series of one-vs-rest binary classification tasks without the need for costly per-class inference. Refined localization is accomplished by our novel region propagation module. Both classification and refined localization rely exclusively on the computed similarity maps.
  • Figure 3: Overview of our classification architecture. Class pre-selection chooses the top-$K$ classes based on the dot product similarity between the average feature of each proposal and class-level prototypes. The probability of each selected class $c_k$ is predicted through a binary classification network, shared by all the classes, in a one-vs-rest manner. The input to this classification network is the similarity map that results from the prototype projection, after rearranging it for each class.
  • Figure 4: Overview of our refined localization architecture. Proposal expansion enlarges each proposal by a fixed ratio to cover more object area. The spatial relationship between the original and expanded proposal is described via a heatmap. The segmentation network navigates the initial heatmap toward accurate object regions. The propagated heatmap is converted into bounding box coordinates through our spatial integral layer.
  • Figure 7: Performance of our method and few-shot SoTA at different shots.
  • ...and 11 more figures