ChatterBox: Multi-round Multimodal Referring and Grounding
Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye
TL;DR
This work defines the multimodal multi-round referring and grounding (MRG) task and introduces CB-300K, a benchmark built from Visual Genome with GPT-4-generated dialogues and a robust evaluation metric that blends language and grounding accuracy. It then presents ChatterBox, a two-branch vision-language model that uses explicit vision modules and a query-guided grounding mechanism, trained via a two-stage optimization over CB-300K and auxiliary data. Empirical results show ChatterBox outperforms baselines on multi-round MRG and transfers effectively to single-round referring and grounding tasks, underscoring the value of explicit grounding components in multimodal dialogue. The dataset and model offer a scalable, plug-and-play framework for complex, instance-level interactions in multimodal AI systems, with implications for advancing multimodal dialogue and AGI-related capabilities.
Abstract
In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions. Code, data, and model are available at: https://github.com/sunsmarterjie/ChatterBox.
