Making Dialogue Grounding Data Rich: A Three-Tier Data Synthesis Framework for Generalized Referring Expression Comprehension
Juexi Shao, Siyou Li, Yujian Gan, Chris Madge, Vanja Karan, Massimo Poesio
TL;DR
This work tackles the data scarcity and distribution shift in Generalized Referring Expression Comprehension (GREC) by introducing a three-tier data synthesis framework that ranges from template-based short expressions to prompted single utterances and full multi-turn dialogues with coreference. The authors combine synthetic data generation with a LoRA-tuned Qwen2-VL model and benchmarking on MDC-R to demonstrate substantial improvements over baselines, while revealing biases introduced by heterogeneous data. Their approach offers scalable supervision for dialogue-conditioned grounding and suggests distribution-aware training to improve generalization across domains. The methodology and findings have broad applicability to vision-language tasks requiring complex grounding and memory across dialogue.
Abstract
Dialogue-Based Generalized Referring Expressions Comprehension (GREC) requires models to ground the expression and unlimited targets in complex visual scenes while resolving coreference across a long dialogue context. However, existing systems struggle under distribution shift between training and evaluation domains, a gap exacerbated by the scarcity of annotated dialogue grounding data. We address this challenge with a three-tier data-synthesis method that balances realism and controllability to produce scalable supervision for dialogue-conditioned grounding. Fine-tuning on the synthesized data yields consistent, substantial improvements over prior approaches across standard evaluation metrics.
