MQADet: A Plug-and-Play Paradigm for Enhancing Open-Vocabulary Object Detection via Multimodal Question Answering
Caixiong Li, Xiongwei Zhao, Jinhang Zhang, Xing Zhang, Qihao Sun, Zhou Wu
TL;DR
MQADet tackles open-vocabulary detection by bridging pre-trained detectors with multimodal language models through a three-stage Multimodal Question Answering pipeline. It first extracts target subjects from complex text (TASE), then positions candidate objects guided by the text with an OV detector (TMOP), and finally selects the optimal object via MLLM reasoning (MOOS). The approach is plug-and-play, requiring no substantial retraining, and demonstrates significant accuracy gains across four challenging datasets (RefCOCO, RefCOCO+, RefCOCOg, Ref-L4) when paired with multiple detectors and MLLMs. These results highlight MQADet’s potential to improve fine-grained visual-text alignment in real-world OV detection tasks and its practical applicability with readily available models.
Abstract
Open-vocabulary detection (OVD) is a challenging task to detect and classify objects from an unrestricted set of categories, including those unseen during training. Existing open-vocabulary detectors are limited by complex visual-textual misalignment and long-tailed category imbalances, leading to suboptimal performance in challenging scenarios. To address these limitations, we introduce MQADet, a universal paradigm for enhancing existing open-vocabulary detectors by leveraging the cross-modal reasoning capabilities of multimodal large language models (MLLMs). MQADet functions as a plug-and-play solution that integrates seamlessly with pre-trained object detectors without substantial additional training costs. Specifically, we design a novel three-stage Multimodal Question Answering (MQA) pipeline to guide the MLLMs to precisely localize complex textual and visual targets while effectively enhancing the focus of existing object detectors on relevant objects. To validate our approach, we present a new benchmark for evaluating our paradigm on four challenging open-vocabulary datasets, employing three state-of-the-art object detectors as baselines. Experimental results demonstrate that our proposed paradigm significantly improves the performance of existing detectors, particularly in unseen complex categories, across diverse and challenging scenarios. To facilitate future research, we will publicly release our code.
