Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
Bram Willemsen, Gabriel Skantze
TL;DR
The paper tackles referring expression generation (REG) in visually grounded dialogue by proposing a two-stage framework: first, a multimodal generator (IDEFICS) produces contextually appropriate REs conditioned on dialogue history and referent image; second, a discourse-aware comprehension-guiding (CRDG) reranks candidates to maximize discriminative power within the dialogue. Discrimination is quantified by composing TIM and ITM scores from a pretrained discriminative VLM, with a pooled score $S_i = w_{a_i} \cdot \ln(a_i + \varepsilon) + w_{b_i} \cdot \ln(b_i + \varepsilon)$ and weights $w_{a_i} = \frac{2}{3}$, $w_{b_i} = \frac{1}{3}$, selecting the candidate with the highest $S_i$. The approach is validated on the AGOS dataset, demonstrating that CRDG-guided reranking yields higher text-image retrieval accuracy than greedy decoding, with human evaluations corroborating improved discriminative performance. The work highlights the value of discourse-aware evaluation in REG and provides LoRA-tuned weights and materials for reproducibility. Limitations include language scope, dataset size, and reliance on a closed-source CRDG setup, suggesting directions for multilingual and larger-scale multimodal studies.
Abstract
We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.
