Table of Contents
Fetching ...

Revisiting Few-Shot Object Detection with Vision-Language Models

Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan

TL;DR

This work interrogates few-shot object detection (FSOD) in the era of foundation vision-language models (VLMs) and demonstrates that zero-shot VLMs can outperform traditional FSOD baselines (e.g., GroundingDINO achieving $AP=48.3$ vs $AP=33.1$). It introduces Foundational FSOD, a benchmark that allows web-scale pretraining and multi-modal $K$-shot alignment on target concepts, repurposing nuImages for evaluation. The paper surveys and evaluates several alignment strategies—prompting, prompt tuning, federated fine-tuning, and multi-modal prompting—and analyzes their impact on few-shot detection performance, including iterative prompting with GPT-4o and multi-modal chat assistants. A CVPR 2024 competition further demonstrates the potential by yielding substantial gains over baselines, underscoring the practical value of language-driven concept alignment in data-constrained settings. Overall, the work advocates updating FSOD benchmarks to reflect the conversation with foundation models and outlines concrete directions for leveraging multi-modal cues to align VLMs with target detection concepts.

Abstract

The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.3 mAP! Our code and dataset splits are available at https://github.com/anishmadan23/foundational_fsod

Revisiting Few-Shot Object Detection with Vision-Language Models

TL;DR

This work interrogates few-shot object detection (FSOD) in the era of foundation vision-language models (VLMs) and demonstrates that zero-shot VLMs can outperform traditional FSOD baselines (e.g., GroundingDINO achieving vs ). It introduces Foundational FSOD, a benchmark that allows web-scale pretraining and multi-modal -shot alignment on target concepts, repurposing nuImages for evaluation. The paper surveys and evaluates several alignment strategies—prompting, prompt tuning, federated fine-tuning, and multi-modal prompting—and analyzes their impact on few-shot detection performance, including iterative prompting with GPT-4o and multi-modal chat assistants. A CVPR 2024 competition further demonstrates the potential by yielding substantial gains over baselines, underscoring the practical value of language-driven concept alignment in data-constrained settings. Overall, the work advocates updating FSOD benchmarks to reflect the conversation with foundation models and outlines concrete directions for leveraging multi-modal cues to align VLMs with target detection concepts.

Abstract

The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.3 mAP! Our code and dataset splits are available at https://github.com/anishmadan23/foundational_fsod
Paper Structure (19 sections, 6 figures, 12 tables, 1 algorithm)

This paper contains 19 sections, 6 figures, 12 tables, 1 algorithm.

Figures (6)

  • Figure 1: Poor Alignment Between Vision Language Models (VLMs) and Target Concepts. Although VLMs show impressive zero-shot performance, they struggle when the target class is different from concepts encountered in web-scale training. On the left, we see that the nuImages dataset caesar2020nuscenes defines the cab of the truck as a separate concept from its trailer (shown in red). In contrast, the VLM predicts the entire vehicle as a truck (shown in green). Similarly, nuImages annotations dictate that a person riding a bicycle must also be labeled as part of bicycle (shown in red) unlike the VLM prediction (in green). On the right, we present the actual class definitions given to the https://github.com/nutonomy/nuscenes-devkit/blob/master/docs/instructions_nuimages.md, provided as both textual descriptions and visual examples. Just as human annotators learn concepts from few-shot multi-modal examples, we argue that VLMs should be aligned with $K$ vision-language examples.
  • Figure 2: Foundational Few-Shot Object Detection (FSOD). Conventional FSOD protocols ( left) allow for pre-training on base classes (with many examples per class) and then fine-tuning on $K$-shots of novel classes, where novel and base are designed to be disjoint. However, we point out that pre-training datasets such as ImageNet often contain classes similar to novel classes, highlighting the issue of concept leakage. As preventing concept leakage is difficult (if not impossible) and appears to be artificial in the foundational era, we propose Foundational FSOD (right). Our setup allows for pre-training on massive (and potentially proprietary) datasets, typical for foundational vision-language models. Since these models can process both text and images, one can utilize such multi-modal$K$-shot examples to align VLMs with the target concepts of interest.
  • Figure 3: Iteratively Prompting ChatGPT. Despite its large-scale pre-training, multi-modal models like ChatGPT-4o also suffers from concept misalignment. Specifically, GPT-4o makes highly confident but incorrect predictions for debris. We propose an iterative prompting strategy to better align the model to a target concept. Given a few visual examples per-class from the training-set, we query ChatGPT to use its "web-scale knowledge" to generate text descriptions. We then augment the input to MQDet to incorporate this additional context for zero-shot evaluation.
  • Figure 4: Visualizing Random and Best Split. In the top row, we visualize the 5-shot training examples of strollers from a random split. Similarly, we visualize the 5-shot training examples from the best split in the bottom row. We observe that strollers in the random split are often occluded, small in size and blurry, making few-shot learning harder. On the other hand, the best split examples are larger, have better visual quality and are relatively un-occluded. This visual difference directly translates into better few-shot performance. We achieve $\mathbf{13.09}$Stroller AP for the random split and $\mathbf{18.54}$Stroller AP for the best split. We show a more comprehensive evaluation in Table \ref{['tab:best_split']}.
  • Figure 5: We visualize the distribution of classes in out test-set compared to the cardinalities of classes in the full nuImages val-set. Notably, our sub-sampling strategy of selecting validation images that have at least one annotation from medium or few classes does not significantly alter the true distribution.
  • ...and 1 more figures