Table of Contents
Fetching ...

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu

TL;DR

DetCLIPv3 tackles the reliance on predefined category vocabularies in open-vocabulary object detection by introducing a versatile framework that jointly performs detection and generation of hierarchical object labels. It combines a robust OV detector with a dedicated object captioner and leverages a large-scale auto-annotation pipeline (GranuCap50M) built from a VLLM-based recaptioning, GPT-4 entity extraction, and LLaVA instruction tuning, followed by a three-stage training scheme that scales learning from low- to high-resolution data. Empirically, it achieves state-of-the-art zero-shot LVIS fixed AP (47.0 with Swin-T) and dense captioning AP on VG (19.7), while exhibiting strong domain generalization (COCO-O) and transfer performance after fine-tuning. The work demonstrates the feasibility and value of generating multi-granular object descriptions to enrich open-vocabulary perception, potentially broadening practical deployment in diverse domains. Overall, DetCLIPv3 advances open-vocabulary detection by enabling descriptive, hierarchical labeling alongside robust detection under limited category priors.

Abstract

Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.

DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

TL;DR

DetCLIPv3 tackles the reliance on predefined category vocabularies in open-vocabulary object detection by introducing a versatile framework that jointly performs detection and generation of hierarchical object labels. It combines a robust OV detector with a dedicated object captioner and leverages a large-scale auto-annotation pipeline (GranuCap50M) built from a VLLM-based recaptioning, GPT-4 entity extraction, and LLaVA instruction tuning, followed by a three-stage training scheme that scales learning from low- to high-resolution data. Empirically, it achieves state-of-the-art zero-shot LVIS fixed AP (47.0 with Swin-T) and dense captioning AP on VG (19.7), while exhibiting strong domain generalization (COCO-O) and transfer performance after fine-tuning. The work demonstrates the feasibility and value of generating multi-granular object descriptions to enrich open-vocabulary perception, potentially broadening practical deployment in diverse domains. Overall, DetCLIPv3 advances open-vocabulary detection by enabling descriptive, hierarchical labeling alongside robust detection under limited category priors.

Abstract

Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.
Paper Structure (42 sections, 14 figures, 15 tables)

This paper contains 42 sections, 14 figures, 15 tables.

Figures (14)

  • Figure 1: The versatility of DetCLIPv3 supports both open-vocabulary object detection (OVD) and the generation of hierarchical object labels. Top: when provided with extracted noun phrases from image-text pair captions as input, DetCLIPv3 can detect a broad spectrum of visual concepts. Bottom: In the absence of predefined categories as input, DetCLIPv3 detects potential objects and generates multi-granularity hierarchical labels for them, formatted as 'phrase $|$ category $|$ parent category'. DetCLIPv3 offers a more comprehensive interpretation of objects, significantly expanding the application scope of OVD systems. Zoom in for the best viewing.
  • Figure 2: (a): Existing open-vocabulary object detectors recognize objects based on category names; (b): Humans interpret visual concepts from multiple hierarchies and granularities.
  • Figure 3: The illustration for DetCLIPv3 framework. Left: the OV detector is responsible for localizing objects given category names, as well as providing object proposals for the object captioner. Right: The object captioner is designed to generate hierarchical labels for detected objects and also learns to generate image-level descriptions as an aid to its training.
  • Figure 3: Examples of bounding box pseudo-labels generated by DetCLIPv3's Swin-L model after stage-1 training.
  • Figure 4: The illustration of quality issues existing in image-text pair data. Row 1: Existing image-text pair dataset typically suffer from significant partial annotation and image-text misalignment problems. Row 2: Limited by capabilities, traditional NLP parsers nltkspacyextract nouns do not correspond to actual object in the images. Row 3: Our data pipeline provides refined captions with highly detailed image descriptions, preserving effective visual concepts from the original captions while supplementing missing concepts. Row 4: Our data pipeline provides rich, multi-granularity object entity information.
  • ...and 9 more figures