Table of Contents
Fetching ...

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng

TL;DR

This work tackles open-vocabulary object detection by enabling a detector to leverage detailed language supervision from a large language model. It introduces GroundingCap-1M, a dataset of 1.12M samples containing image-level captions and region-grounding annotations, used to train LLMDet via joint grounding and caption-generation losses. LLMDet achieves state-of-the-art zero-shot performance on LVIS, ODinW, COCO-O, and referring expression datasets, demonstrating strong cross-domain generalization. Moreover, co-training an open-vocabulary detector with an LLM yields mutual benefits, enabling stronger large multimodal models when combined with other vision encoders. The work highlights the value of high-quality image-level captions for enriching vision-language representations and improving open-vocabulary detection.

Abstract

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.

LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models

TL;DR

This work tackles open-vocabulary object detection by enabling a detector to leverage detailed language supervision from a large language model. It introduces GroundingCap-1M, a dataset of 1.12M samples containing image-level captions and region-grounding annotations, used to train LLMDet via joint grounding and caption-generation losses. LLMDet achieves state-of-the-art zero-shot performance on LVIS, ODinW, COCO-O, and referring expression datasets, demonstrating strong cross-domain generalization. Moreover, co-training an open-vocabulary detector with an LLM yields mutual benefits, enabling stronger large multimodal models when combined with other vision encoders. The work highlights the value of high-quality image-level captions for enriching vision-language representations and improving open-vocabulary detection.

Abstract

Recent open-vocabulary detectors achieve promising performance with abundant region-level annotated data. In this work, we show that an open-vocabulary detector co-training with a large language model by generating image-level detailed captions for each image can further improve performance. To achieve the goal, we first collect a dataset, GroundingCap-1M, wherein each image is accompanied by associated grounding labels and an image-level detailed caption. With this dataset, we finetune an open-vocabulary detector with training objectives including a standard grounding loss and a caption generation loss. We take advantage of a large language model to generate both region-level short captions for each region of interest and image-level long captions for the whole image. Under the supervision of the large language model, the resulting detector, LLMDet, outperforms the baseline by a clear margin, enjoying superior open-vocabulary ability. Further, we show that the improved LLMDet can in turn build a stronger large multi-modal model, achieving mutual benefits. The code, model, and dataset is available at https://github.com/iSEE-Laboratory/LLMDet.

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures, 12 tables.

Figures (7)

  • Figure 1: LLMDet achieves superior zero-shot performance across various benchmarks compared with other well-known counterparts. All detectors use Swin-T as the backbone.
  • Figure 2: An example of GroundingCap-1M. Bounding box annotations are discarded for clarity. Compared with original short grounding texts, the detailed captions in GroundingCap-1M are rich in object types, textures, colors, parts of the objects, object actions, precise object locations and texts. Each caption in GroundingCap-1M has around 115 words on average.
  • Figure 3: The multi-step training pipeline of LLMDet. In each step, modules in orange color are tunable while modules in blue color are frozen. In the first step, we train a projector to align the detector's features with the LLM so that we can integrate the LLM into the detector without breaking the pretrained features. Then, we train the detector with a standard grounding task and newly introduced captioning tasks in Step 2.
  • Figure 4: The overview of LLMDet. LLMDet contains a standard open-vocabulary detector and a large language model (LLM) and is trained under both grounding loss and language modeling loss. The LLM is designed to generate both image-level captions using feature maps as visual input and region-level captions using a single object query as visual input, which are separated by different prompts. Only vision tokens in region-level generation pass through the cross-attention (CA) modules in LLM, which is highlighted by a dashed boundary. Since the number of tokens in image-level and region-level generation varies greatly, we forward the LLM twice separately to save memory and computation. The LLM can be discarded in the inference time so that there is no extra cost.
  • Figure A-1: The multi-step training pipeline of using LLMDet to build a strong large multi-modal model. The large multi-modal model uses a mixture of vision encoders, including LLMDet and SigLIP. In each step, modules in orange color are tunable while modules in blue color are frozen. We first pretrain a new projector and then finetune the large multi-modal model with visual instruct tuning.
  • ...and 2 more figures