Table of Contents
Fetching ...

MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering

Caixiong Li, Xiongwei Zhao, Jinhang Zhang, Xing Zhang, Qihao Sun, Zhou Wu

TL;DR

MQADet tackles open-vocabulary detection by bridging pre-trained detectors with multimodal language models through a three-stage Multimodal Question Answering pipeline. It first extracts target subjects from complex text (TASE), then positions candidate objects guided by the text with an OV detector (TMOP), and finally selects the optimal object via MLLM reasoning (MOOS). The approach is plug-and-play, requiring no substantial retraining, and demonstrates significant accuracy gains across four challenging datasets (RefCOCO, RefCOCO+, RefCOCOg, Ref-L4) when paired with multiple detectors and MLLMs. These results highlight MQADet’s potential to improve fine-grained visual-text alignment in real-world OV detection tasks and its practical applicability with readily available models.

Abstract

Open-vocabulary detection (OVD) is a challenging task to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances, leading to suboptimal performance in challenging scenarios. To address these limitations, we introduce MQADet, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors without substantial additional training costs. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets while effectively enhancing the focus of existing object detectors on relevant objects. To validate our approach, we present a new benchmark for evaluating our paradigm on four challenging open-vocabulary datasets, employing three state-of-the-art object detectors as baselines. Experimental results demonstrate that our proposed paradigm significantly improves the performance of existing detectors, particularly in unseen complex categories, across diverse and challenging scenarios. To facilitate future research, we will publicly release our code.

MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering

TL;DR

MQADet tackles open-vocabulary detection by bridging pre-trained detectors with multimodal language models through a three-stage Multimodal Question Answering pipeline. It first extracts target subjects from complex text (TASE), then positions candidate objects guided by the text with an OV detector (TMOP), and finally selects the optimal object via MLLM reasoning (MOOS). The approach is plug-and-play, requiring no substantial retraining, and demonstrates significant accuracy gains across four challenging datasets (RefCOCO, RefCOCO+, RefCOCOg, Ref-L4) when paired with multiple detectors and MLLMs. These results highlight MQADet’s potential to improve fine-grained visual-text alignment in real-world OV detection tasks and its practical applicability with readily available models.

Abstract

Open-vocabulary detection (OVD) is a challenging task to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances, leading to suboptimal performance in challenging scenarios. To address these limitations, we introduce MQADet, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors without substantial additional training costs. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets while effectively enhancing the focus of existing object detectors on relevant objects. To validate our approach, we present a new benchmark for evaluating our paradigm on four challenging open-vocabulary datasets, employing three state-of-the-art object detectors as baselines. Experimental results demonstrate that our proposed paradigm significantly improves the performance of existing detectors, particularly in unseen complex categories, across diverse and challenging scenarios. To facilitate future research, we will publicly release our code.

Paper Structure

This paper contains 28 sections, 6 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Example of OVD in challenging scenarios. The detection target in this case is described as "a teddy bear with a checkered design on one foot and a bumble bee design on the other foot . the bear also has the checkered design over its ' ears". Comparison with previous OV detectors (e.g., Grounding DINO, YOLO-World, and OmDet-Turbo), MQADet significantly improves detection accuracy for objects described by complex textual queries.
  • Figure 2: An overview of the proposed MQADet paradigm, which consists of three Multimodal Question Answering (MQA) stages: Text-Aware Subject Extraction (TASE), Text-Guided Multimodal Object Positioning (TMOP), and MLLMs-Driven Optimal Object Selection (MOOS).
  • Figure 3: Representative three cases using our MQADet paradigm. The specific inputs to MQADet and the corresponding output results across three phases are presented. Our paradigm effectively discerns a wider range of categories and reasons about the correct answers.
  • Figure 4: The visualization results of Grounding DINO, YOLO-World, OmDet-Turbo, and MQADet, with GPT-4o employed as the MLLM. Pink words indicate the subjects identified from the user query. Please zoom in to view the detailed labels.
  • Figure 5: Performance comparison of MQADet and detectors on challenging RefCOCOg and Ref-L4. MLLM employs GPT-4o, while object detectors utilize Grounding DINO and YOLO-World. The evaluation metric is Acc@0.5. RgTrain, RgVal, RgTest, RL4Val, RL4Test refer to the RefCOCOg train, RefCOCOg val, RefCOCOg test, Ref-L4 val, and Ref-L4 test datasets, respectively.