Table of Contents
Fetching ...

ChatterBox: Multi-round Multimodal Referring and Grounding

Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, Qixiang Ye

TL;DR

This work defines the multimodal multi-round referring and grounding (MRG) task and introduces CB-300K, a benchmark built from Visual Genome with GPT-4-generated dialogues and a robust evaluation metric that blends language and grounding accuracy. It then presents ChatterBox, a two-branch vision-language model that uses explicit vision modules and a query-guided grounding mechanism, trained via a two-stage optimization over CB-300K and auxiliary data. Empirical results show ChatterBox outperforms baselines on multi-round MRG and transfers effectively to single-round referring and grounding tasks, underscoring the value of explicit grounding components in multimodal dialogue. The dataset and model offer a scalable, plug-and-play framework for complex, instance-level interactions in multimodal AI systems, with implications for advancing multimodal dialogue and AGI-related capabilities.

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions. Code, data, and model are available at: https://github.com/sunsmarterjie/ChatterBox.

ChatterBox: Multi-round Multimodal Referring and Grounding

TL;DR

This work defines the multimodal multi-round referring and grounding (MRG) task and introduces CB-300K, a benchmark built from Visual Genome with GPT-4-generated dialogues and a robust evaluation metric that blends language and grounding accuracy. It then presents ChatterBox, a two-branch vision-language model that uses explicit vision modules and a query-guided grounding mechanism, trained via a two-stage optimization over CB-300K and auxiliary data. Empirical results show ChatterBox outperforms baselines on multi-round MRG and transfers effectively to single-round referring and grounding tasks, underscoring the value of explicit grounding components in multimodal dialogue. The dataset and model offer a scalable, plug-and-play framework for complex, instance-level interactions in multimodal AI systems, with implications for advancing multimodal dialogue and AGI-related capabilities.

Abstract

In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model's stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions. Code, data, and model are available at: https://github.com/sunsmarterjie/ChatterBox.
Paper Structure (26 sections, 2 equations, 9 figures, 6 tables)

This paper contains 26 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: A showcase of the multi-round referring and grounding (MRG) task. During the dialogue, the agent can receive either a [REF] token for referring expressions or a [GND] token for visual grounding; without these tokens, the task becomes generic visual question answering. All the answers are generated by the ChatterBox agent, demonstrating its strong ability in visual recognition. In particular, ChatterBox can understand logically related questions and incorporate contextual information to provide answers. For instance, in the right-hand thread, the question 'Where is the other one?' necessitates the agent to recognize that 'one' refers to a person and then locate the 'other' person distinct from the one mentioned earlier.
  • Figure 2: The CB-300K data contains four subsets for different purposes. The images and metadata (object locations and descriptions) are inherited from Visual Genome. The same image can appear in different subsets. The former two subsets, CB-MRG and CB-LC, are obtained by prompting GPT-4 to read the metadata and generate questions and answers. The latter two subsets, CB-REF and CB-GND, are produced using manually designed rules and then polished by GPT-3.5. This figure is best viewed in color.
  • Figure 3: The architecture of the ChatterBox model. It receives the image and the current question with dialogue history as input, and produces language output and, if necessary, vision output (i.e., visual grounding results). The location decoder is magnified to illustrate the interaction between the query token and visual features. This figure is best viewed in color.
  • Figure 4: A qualitative comparison of multi-round dialogue between LISA laiLISAReasoningSegmentation2023, Kosmos-2 pengKosmos2GroundingMultimodal2023, and ChatterBox(ours). Our model demonstrates a superior ability to understand multi-round dialogues and perform reasoning (please also refer to the examples in Figure \ref{['fig:task']}). We stress that the stronger ability of visual recognition is brought by the explicit vision modules.
  • Figure 5: The architectural visualization of target-guided query.
  • ...and 4 more figures