Table of Contents
Fetching ...

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Sheng Jin, Xueying Jiang, Jiaxing Huang, Lewei Lu, Shijian Lu

TL;DR

This work tackles open-vocabulary object detection by leveraging fine-grained descriptor knowledge in vision-language models. It presents DVDet, which combines a conditional context prompt (CCP) that turns region features into image-like prompts for improved classification with open-vocabulary labels, and a hierarchical, iterative descriptor generation flow that uses large language models (LLMs) to mine and refine region-specific descriptors. The two flows are integrated into a standard two-stage detector with CLIP-based text embeddings, enabling effective region-text alignment without extra grounding data. Across COCO and LVIS benchmarks, DVDet delivers consistent, substantial gains over existing OVOD methods, demonstrating the value of descriptor-level alignment and LLM-assisted descriptor refinement for open-vocabulary dense prediction. The approach suggests a scalable path to fuse LLMs and VLMs for robust open-vocabulary detection in real-world applications.

Abstract

Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

TL;DR

This work tackles open-vocabulary object detection by leveraging fine-grained descriptor knowledge in vision-language models. It presents DVDet, which combines a conditional context prompt (CCP) that turns region features into image-like prompts for improved classification with open-vocabulary labels, and a hierarchical, iterative descriptor generation flow that uses large language models (LLMs) to mine and refine region-specific descriptors. The two flows are integrated into a standard two-stage detector with CLIP-based text embeddings, enabling effective region-text alignment without extra grounding data. Across COCO and LVIS benchmarks, DVDet delivers consistent, substantial gains over existing OVOD methods, demonstrating the value of descriptor-level alignment and LLM-assisted descriptor refinement for open-vocabulary dense prediction. The approach suggests a scalable path to fuse LLMs and VLMs for robust open-vocabulary detection in real-world applications.

Abstract

Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
Paper Structure (14 sections, 3 equations, 5 figures, 5 tables)

This paper contains 14 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Differences in image-text alignments by VLMs and OVOD. Over the whole COCO dataset, the visual and textual embeddings from VLMs are clearly better aligned than those from OVOD (by the state-of-the-art VLDet lin2023VLDet) for both categorical object labels and fine-grained descriptors as shown in (a). The proposed DVDet mines and refines fine-grained descriptors with LLMs which clearly improves region-text alignment as compared with VLDet. This can be viewed in more detail in (b) with an exemplar label 'bicycle’ and fine-grained descriptors of bicycle parts. The alignment is measured by the cosine similarity between visual and textual embeddings.
  • Figure 2: Overview of our proposed DVDet framework: DVDet comprises two specific flows to improve the region-text alignment in open vocabulary detection. In the prompt flow (denoted by the solid line), the proposed conditional context prompt (CCP) transforms the ROI embeddings into image-like representation by fusing the contextual background information around the region proposal, that can be incorporated to facilitate the training of open vocabulary detectors. In the descriptor flow (denoted by the dashed line), a hierarchy mechanism is designed to generate and update fine-grained descriptors via iterative interaction with LLMs for precise region-text alignment.
  • Figure 3: The iterative update of fine-grained descriptors. In the training stage, we continuously generate new fine-grained descriptors (highlighted in blue boxes) via interaction with LLMs. With the recorded usage frequency, high-frequency descriptors (highlighted in green) are preserved and low-frequency descriptors (highlighted in red) are discarded. We can observe that certain fine-grained descriptors such as 'hair’, 'two eyes’, and 'face’ are consistently preserved after generation while visually irrelevant descriptors such as 'jewelry’ are only generated at early stage and then discarded.
  • Figure 4: Introducing our fine-grained text descriptors (shown at the bottom of each sample) improves the open-vocabulary detection consistently especially under challenging scenarios with distant or occluded objects, small inter-class variations, etc. For each of the four sample images, the red-color class names at the top-left corner of the first image are predictions without our method, and the green-color class names in the second image are predictions after including our method. The red/green boxes within the sample images show related detection. Close-up view for details.
  • Figure 5: Object detection is improved progressively with iterative extraction of fine-grained descriptors from LLMs and matching them with the detected target. Texts at the top-left corner of each sample show recognized classes, and texts at the bottom show extracted fine-grained descriptors.