DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Lewei Yao; Renjie Pi; Jianhua Han; Xiaodan Liang; Hang Xu; Wei Zhang; Zhenguo Li; Dan Xu

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu

TL;DR

DetCLIPv3 tackles the reliance on predefined category vocabularies in open-vocabulary object detection by introducing a versatile framework that jointly performs detection and generation of hierarchical object labels. It combines a robust OV detector with a dedicated object captioner and leverages a large-scale auto-annotation pipeline (GranuCap50M) built from a VLLM-based recaptioning, GPT-4 entity extraction, and LLaVA instruction tuning, followed by a three-stage training scheme that scales learning from low- to high-resolution data. Empirically, it achieves state-of-the-art zero-shot LVIS fixed AP (47.0 with Swin-T) and dense captioning AP on VG (19.7), while exhibiting strong domain generalization (COCO-O) and transfer performance after fine-tuning. The work demonstrates the feasibility and value of generating multi-granular object descriptions to enrich open-vocabulary perception, potentially broadening practical deployment in diverse domains. Overall, DetCLIPv3 advances open-vocabulary detection by enabling descriptive, hierarchical labeling alongside robust detection under limited category priors.

Abstract

Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

TL;DR

Abstract

Paper Structure (42 sections, 14 figures, 15 tables)

This paper contains 42 sections, 14 figures, 15 tables.

Introduction
Related works
Open-vocabulary object detection.
Dense captioning.
Re-captioning for image-text paris.
Method
Model Design
Data formulation.
Open vocabulary detector.
Object captioner.
Dataset Construction
Auto-annotation data pipeline.
None concept corpus.
Multi-stage Training Scheme
Experiments
...and 27 more sections

Figures (14)

Figure 1: The versatility of DetCLIPv3 supports both open-vocabulary object detection (OVD) and the generation of hierarchical object labels. Top: when provided with extracted noun phrases from image-text pair captions as input, DetCLIPv3 can detect a broad spectrum of visual concepts. Bottom: In the absence of predefined categories as input, DetCLIPv3 detects potential objects and generates multi-granularity hierarchical labels for them, formatted as 'phrase $|$ category $|$ parent category'. DetCLIPv3 offers a more comprehensive interpretation of objects, significantly expanding the application scope of OVD systems. Zoom in for the best viewing.
Figure 2: (a): Existing open-vocabulary object detectors recognize objects based on category names; (b): Humans interpret visual concepts from multiple hierarchies and granularities.
Figure 3: The illustration for DetCLIPv3 framework. Left: the OV detector is responsible for localizing objects given category names, as well as providing object proposals for the object captioner. Right: The object captioner is designed to generate hierarchical labels for detected objects and also learns to generate image-level descriptions as an aid to its training.
Figure 3: Examples of bounding box pseudo-labels generated by DetCLIPv3's Swin-L model after stage-1 training.
Figure 4: The illustration of quality issues existing in image-text pair data. Row 1: Existing image-text pair dataset typically suffer from significant partial annotation and image-text misalignment problems. Row 2: Limited by capabilities, traditional NLP parsers nltkspacyextract nouns do not correspond to actual object in the images. Row 3: Our data pipeline provides refined captions with highly detailed image descriptions, preserving effective visual concepts from the original captions while supplementing missing concepts. Row 4: Our data pipeline provides rich, multi-granularity object entity information.
...and 9 more figures

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

TL;DR

Abstract

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (14)