Generative Region-Language Pretraining for Open-Ended Object Detection

Chuang Lin; Yi Jiang; Lizhen Qu; Zehuan Yuan; Jianfei Cai

Generative Region-Language Pretraining for Open-Ended Object Detection

Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai

TL;DR

This paper forms object detection as a generative problem and proposes a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way and introduces an evaluation method designed to quantitatively measure the performance of generative out-comes.

Abstract

In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// github.com/FoundationVision/GenerateU .

Generative Region-Language Pretraining for Open-Ended Object Detection

TL;DR

Abstract

Paper Structure (23 sections, 3 equations, 7 figures, 6 tables)

This paper contains 23 sections, 3 equations, 7 figures, 6 tables.

Introduction
Related Work
Open-Vocabulary Object Detection
Multimodal Large Language Model
Dense Captioning
Method
Open-World Object Detection
Transferring from a Frozen Multimodal LLM
Generative Region-Language Pretraining
Enrich Label Diversity
Experiments
Datasets
Evaluation protocol and metrics
Implementation Details
Generative Open-Ended Detection Results
...and 8 more sections

Figures (7)

Figure 1: Comparing generative open-ended object detection with other open-set object detection tasks. Open-vocabulary object detection and phrase grounding typically require predefined categories or phrases in text prompts to align with image regions. In contrast, our introduced generative open-ended object detection is a more general and practical setting where categorical information is not explicitly defined. Such a setting is especially meaningful for scenarios where users lack precise knowledge of object categories during inference.
Figure 2: Overview of our proposed open-ended object detection model, GenerateU, which comprises two major components: an object detector and a language model. We compare two training strategies: (Top) We incorporate the class-agnostic DETR (with frozen image encoder) into a pre-trained and frozen Multimodal Large Language Model (including Adaptor and Language Model), to facilitate a smooth transfer of knowledge from the language domain to object detection; (Bottom) We activate the image encoder and language model as trainable components, taking an end-to-end approach to seamlessly integrate region-level understanding into the language model.
Figure 3: Selected pseudo-label examples highlight the generation of bounding boxes covering nearly all objects in the images. A varied set of descriptive labels showcases the model's capability in producing diverse and linguistically rich vocabulary. A white underline in a figure indicates that the pseudo label comes from the noun phrase (black underline) in the caption.
Figure 4: Effect of beam size. Beam search plays a crucial role in generating rare object names, effectively addressing the long-tail problem.
Figure 5: Qualitative prediction results from GenerateU and ground truth on LVIS gupta2019lvis. GenerateU produces complete and precise predictions, showcasing its ability to go beyond fixed vocabulary constraints.
...and 2 more figures

Generative Region-Language Pretraining for Open-Ended Object Detection

TL;DR

Abstract

Generative Region-Language Pretraining for Open-Ended Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)