Table of Contents
Fetching ...

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

TL;DR

RF100-VL introduces a large-scale, multi-domain detection benchmark to probe vision-language models on concepts outside typical internet-scale pre-training. By providing 100 diverse datasets from Roboflow Universe and rich multi-modal annotator instructions, the work enables zero-shot, few-shot, semi-supervised, and fully supervised evaluations. Empirical results show open-vocabulary detectors and specialist detectors outperform generalist MLLMs in many settings, while multi-modal instructions offer limited consistent gains, underscoring the need for better concept alignment strategies. The benchmark and accompanying findings aim to spur development of robust, cross-domain VLMs capable of few-shot concept alignment and open-set detection across varied imaging modalities.

Abstract

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

TL;DR

RF100-VL introduces a large-scale, multi-domain detection benchmark to probe vision-language models on concepts outside typical internet-scale pre-training. By providing 100 diverse datasets from Roboflow Universe and rich multi-modal annotator instructions, the work enables zero-shot, few-shot, semi-supervised, and fully supervised evaluations. Empirical results show open-vocabulary detectors and specialist detectors outperform generalist MLLMs in many settings, while multi-modal instructions offer limited consistent gains, underscoring the need for better concept alignment strategies. The benchmark and accompanying findings aim to spur development of robust, cross-domain VLMs capable of few-shot concept alignment and open-set detection across varied imaging modalities.

Abstract

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

Paper Structure

This paper contains 30 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Roboflow100-VL Dataset. We identify a set of $100$ challenging datasets from Roboflow Universe that contain concepts not typically found in internet-scale pre-training. To simplify analysis, we cluster these $100$ datasets using per-dataset CLIP radford2021learning embeddings into seven categories. We visualize examples from each of these categories above. Furthermore, we also generate multi-modal instructions for each dataset with a few visual examples and rich textual descriptions per class to facilitate few-shot concept alignment.
  • Figure 2: Hard Examples in Roboflow100-VL. Our dataset is particularly challenging because it is difficult to detect objects in RF100-VL using class-names alone. Specifically, we select datasets where classes are labeled using scientific names, acronyms, context-dependent names, material properties. We posit that models must leverage multi-modal contextual annotations to address such hard examples.
  • Figure 3: Multi-Modal Few-Shot Examples. We present an example of the few-shot visual examples and rich text descriptions used for in-context prompting and fine-tuning. Notably, image examples used for each class may overlap and are only guaranteed to have exhaustive annotations for one class. Such multi-modal examples help clarify ambiguous concepts like soft plastic and metal.
  • Figure 4: Dataset Curation. We begin by sorting all object detection datasets on Roboflow Universe by stars as a proxy for quality and usefulness to the community. Next, we manually filter out all datasets with common classes, datasets where images only have a single focal object, or datasets with watermarks. We generate 10-shot splits following the protocol defined by Wang et.al. wang2020frustratingly, where we find a subset of images with 10 total instances per class. We use these 10-shot splits to generate visually grounded "annotator instructions", and manually update these instructions to add any salient details missed by GPT-4o. Finally, human labelers verify that all images within a dataset follow consistent annotation policies (e.g. bounding-box fit, semantic legibility of class names, and completeness of annotation instructions).
  • Figure 5: Dataset Statistics. The table on the left provides details on the number of classes, images, and annotations across different dataset types within RF100-VL. The figure on the right illustrates the distribution of dataset types by count. Notably, despite containing 100 datasets, RF100-VL is 50% the size of COCO lin2014microsoft (by number of images) and can feasibly be benchmarked on academic-level compute.
  • ...and 3 more figures