Table of Contents
Fetching ...

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan, Massimo Poesio

TL;DR

This work tackles the data scarcity and distribution shift in Generalized Referring Expression Comprehension (GREC) by introducing a three-tier data synthesis framework that ranges from template-based short expressions to prompted single utterances and full multi-turn dialogues with coreference. The authors combine synthetic data generation with a LoRA-tuned Qwen2-VL model and benchmarking on MDC-R to demonstrate substantial improvements over baselines, while revealing biases introduced by heterogeneous data. Their approach offers scalable supervision for dialogue-conditioned grounding and suggests distribution-aware training to improve generalization across domains. The methodology and findings have broad applicability to vision-language tasks requiring complex grounding and memory across dialogue.

Abstract

Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension

TL;DR

This work tackles the data scarcity and distribution shift in Generalized Referring Expression Comprehension (GREC) by introducing a three-tier data synthesis framework that ranges from template-based short expressions to prompted single utterances and full multi-turn dialogues with coreference. The authors combine synthetic data generation with a LoRA-tuned Qwen2-VL model and benchmarking on MDC-R to demonstrate substantial improvements over baselines, while revealing biases introduced by heterogeneous data. Their approach offers scalable supervision for dialogue-conditioned grounding and suggests distribution-aware training to improve generalization across domains. The methodology and findings have broad applicability to vision-language tasks requiring complex grounding and memory across dialogue.

Abstract

Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.

Paper Structure

This paper contains 13 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The MC environment - example of the image of synthetic expression 'the second green block from the top'.
  • Figure 2: Method of generating multi-turn dialogue containing coreference chain.
  • Figure 3: A full dialogue (upper) with coreference chain highlighted in green and inference results (below) of Qwen2-VL under various settings. The green bounding box represents the ground truth, while the red bounding box represents the prediction.