Detect Everything with Few Examples

Xinyu Zhang; Yuhan Liu; Yuting Wang; Abdeslam Boularias

Detect Everything with Few Examples

Xinyu Zhang, Yuhan Liu, Yuting Wang, Abdeslam Boularias

TL;DR

This paper introduces DE-ViT, a few-shot object detector without the need for finetuning, based on a new region-propagation mechanism for localization that establishes new state-of-the-art results on all benchmarks.

Abstract

Few-shot object detection aims at detecting novel categories given only a few example images. It is a basic skill for a robot to perform tasks in open environments. Recent methods focus on finetuning strategies, with complicated procedures that prohibit a wider application. In this paper, we introduce DE-ViT, a few-shot object detector without the need for finetuning. DE-ViT's novel architecture is based on a new region-propagation mechanism for localization. The propagated region masks are transformed into bounding boxes through a learnable spatial integral layer. Instead of training prototype classifiers, we propose to use prototypes to project ViT features into a subspace that is robust to overfitting on base classes. We evaluate DE-ViT on few-shot, and one-shot object detection benchmarks with Pascal VOC, COCO, and LVIS. DE-ViT establishes new state-of-the-art results on all benchmarks. Notably, for COCO, DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms few-shot SoTA by 17 box APr. Further, we evaluate DE-ViT with a real robot by building a pick-and-place system for sorting novel objects based on example images. The videos of our robot demonstrations, the source code and the models of DE-ViT can be found at https://mlzxy.github.io/devit.

Detect Everything with Few Examples

TL;DR

Abstract

Paper Structure (21 sections, 9 equations, 16 figures, 11 tables)

This paper contains 21 sections, 9 equations, 16 figures, 11 tables.

Introduction
Related work
Method
Classification with an Unknown Number of Classes
Localization with Region Propagation.
Building Prototypes
Experiments
Main Results
Analysis
Ablation Study
Conclusion
Appendix
Additional Experiments
Over-expanded Proposal Analysis
Detection Accuracy at More Shots
...and 6 more sections

Figures (16)

Figure 1: Demonstration of the proposed method on YCB objects calli2015ycb. DE-ViT with ViT-L/14 is used for prediction. Note that our model is trained on only the base categories of LVIS. Example images of YCB objects are provided only during inference to represent novel categories.
Figure 2: Overview of the proposed method. Our approach uses DINOv2 ViT to encode the image into a feature map, from which proposal features are extracted using ROIAlign. Proposals are generated via an off-the-shelf RPN. Prototype projection transforms proposal features into similarity maps based on prototypes derived from ViT features of support images. Multi-class classification of proposals is recast as a series of one-vs-rest binary classification tasks without the need for costly per-class inference. Refined localization is accomplished by our novel region propagation module. Both classification and refined localization rely exclusively on the computed similarity maps.
Figure 3: Overview of our classification architecture. Class pre-selection chooses the top-$K$ classes based on the dot product similarity between the average feature of each proposal and class-level prototypes. The probability of each selected class $c_k$ is predicted through a binary classification network, shared by all the classes, in a one-vs-rest manner. The input to this classification network is the similarity map that results from the prototype projection, after rearranging it for each class.
Figure 4: Overview of our refined localization architecture. Proposal expansion enlarges each proposal by a fixed ratio to cover more object area. The spatial relationship between the original and expanded proposal is described via a heatmap. The segmentation network navigates the initial heatmap toward accurate object regions. The propagated heatmap is converted into bounding box coordinates through our spatial integral layer.
Figure 7: Performance of our method and few-shot SoTA at different shots.
...and 11 more figures

Detect Everything with Few Examples

TL;DR

Abstract

Detect Everything with Few Examples

Authors

TL;DR

Abstract

Table of Contents

Figures (16)