Scaling Open-Vocabulary Object Detection

Matthias Minderer; Alexey Gritsenko; Neil Houlsby

Scaling Open-Vocabulary Object Detection

Matthias Minderer, Alexey Gritsenko, Neil Houlsby

TL;DR

This paper addresses the data bottleneck in open-vocabulary object detection by scaling self-training with Web image-text pairs. It introduces OWL-ST, a simple yet scalable self-training recipe, and OWLv2, an efficiency-optimized architecture, enabling training on billions of pseudo-annotations. The approach yields strong LVIS rare performance and broad open-world generalization, with profound improvements when scaling to Web-scale data. It also examines label-space design, pseudo-annotation filtering, and fine-tuning trade-offs, highlighting the potential and practical limits of self-training for open-vocabulary localization.

Abstract

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling.

Scaling Open-Vocabulary Object Detection

TL;DR

Abstract

Paper Structure (41 sections, 11 figures, 5 tables)

This paper contains 41 sections, 11 figures, 5 tables.

Introduction
Related Work
Scaling Vision Models
Open-Vocabulary Object Detection
Scaling Open-Vocabulary Detection with Weak Supervision
Method
Generating Web-Scale Open-Vocabulary Object Annotations
Human-curated label space.
Machine-generated label space.
Self-training at Scale
Token dropping.
Instance selection.
Mosaics.
Fine-tuning
Experiments
...and 26 more sections

Figures (11)

Figure 1: Overview of our method. Left: Our method consists of three steps: (1) Generate pseudo-box annotations on WebLI with OWL-ViT L/14, queried with caption N-grams. (2) Train new models on pseudo-annotations. (3) Optionally, fine-tune on human annotations. Right: Zero-shot detection performance on $\text{LVIS}_\text{rare}$ after fine-tuning on $\text{LVIS}_\text{base}$. Neither the annotator nor our models have seen any human-generated box annotations for $\text{LVIS}_\text{rare}$ classes. Our self-training approach improves over other methods even at moderate amounts of training (e.g. the OWL-L/14 model we use as annotator; black $\times$), and continues to improve as training is scaled up. Horizontal black lines indicate previous state-of-the-art open-vocabulary detectors which did not see $\text{LVIS}_\text{rare}$ classes during training.
Figure 2: Comparison of pseudo-label spaces. Self-training on a human-curated list of classes yields good downstream performance on these classes, but generalizes poorly to unseen classes and datasets. Open-vocabulary generalization can be improved by obtaining weak but diverse supervision from image-associated text. WebLI image-text data was pseudo-annotated using OWL-ViT CLIP-L/14 with one of three label spaces: Curated vocabulary (the union of label spaces from LVIS, Objects365, OpenImagesv4, and Visual Genome), N-grams (lightly filtered N-grams from the text associated with each image), or a combination of both (N-grams + curated). OWLv2-B/16 models were then self-trained on the pseudo-annotations and fine-tuned on $\text{LVIS}_\text{base}$. Each point represents a separate fine-tuning run. "Examples seen" refers to the number of images after creating mosaics; the total number of raw images seen is $13.2 \times$ that number (\ref{['sec:self-training']}).
Figure 3: Impact of pseudo-annotation filtering by detection confidence on self-training effectiveness. Pseudo-labels (N-gram label space) were filtered using different confidence thresholds. Number of remaining images for each threshold: 0.1: 5B, 0.3: 2B, 0.5: 782M, 0.7: 224M. OWLv2-B/16 detectors were self-trained on the filtered pseudo-annotations and fine-tuned on $\text{LVIS}_\text{base}$. Each point represents a different fine-tuning run. "Examples seen" refers to the number of images after creating mosaics; the total number of raw images seen is $13.2 \times$ that number (\ref{['sec:self-training']}).
Figure 4: Scaling of detection performance with model size and training compute. Models show classic scaling behavior scalingvit: Performance increases monotonically with training compute, with larger models being necessary to benefit from larger amounts of compute/data. Models were self-trained on N-gram pseudo-annotations and fine-tuned on $\text{LVIS}_\text{base}$.
Figure 5: Trade-off between fine-tuned and open-world performance. Self-training yields continued improvements on a suite of diverse datasets (ODinW13; $x$-axis), but performance on any given dataset (e.g. LVIS; $y$-axis) may saturate (red circles). Fine-tuning on a target dataset improves performance on that dataset, but reduces the open-world generalization ability in proportion to the finetuning duration (light blue squares; numbers indicate finetuning steps). This trade-off can be improved through weight-space ensembling (averaging) of the pretrained and fine-tuned checkpoints wortsman2022robust (purple diamonds; numbers indicate the mixing coefficient for the fine-tuned weights). The plot shows B/16 models self-trained on N-gram pseudo-annotations and evaluated either directly after self-training or after fine-tuning on $\text{LVIS}_\text{base}$. Ensembles were created between the longest-self-trained checkpoint and the weights obtained after finetuning that checkpoint for 20k steps. Note that there is significant variability in ODinW13 performance between checkpoints towards the end of self-training.
...and 6 more figures

Scaling Open-Vocabulary Object Detection

TL;DR

Abstract

Scaling Open-Vocabulary Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (11)