Revisiting Few-Shot Object Detection with Vision-Language Models
Anish Madan, Neehar Peri, Shu Kong, Deva Ramanan
TL;DR
This work interrogates few-shot object detection (FSOD) in the era of foundation vision-language models (VLMs) and demonstrates that zero-shot VLMs can outperform traditional FSOD baselines (e.g., GroundingDINO achieving $AP=48.3$ vs $AP=33.1$). It introduces Foundational FSOD, a benchmark that allows web-scale pretraining and multi-modal $K$-shot alignment on target concepts, repurposing nuImages for evaluation. The paper surveys and evaluates several alignment strategies—prompting, prompt tuning, federated fine-tuning, and multi-modal prompting—and analyzes their impact on few-shot detection performance, including iterative prompting with GPT-4o and multi-modal chat assistants. A CVPR 2024 competition further demonstrates the potential by yielding substantial gains over baselines, underscoring the practical value of language-driven concept alignment in data-constrained settings. Overall, the work advocates updating FSOD benchmarks to reflect the conversation with foundation models and outlines concrete directions for leveraging multi-modal cues to align VLMs with target detection concepts.
Abstract
The era of vision-language models (VLMs) trained on web-scale datasets challenges conventional formulations of "open-world" perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs. First, we point out that zero-shot predictions from VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundation models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest. Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external data and fine-tuned on multi-modal (text and visual) K-shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.3 mAP! Our code and dataset splits are available at https://github.com/anishmadan23/foundational_fsod
