Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Bram Willemsen; Gabriel Skantze

Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Bram Willemsen, Gabriel Skantze

TL;DR

The paper tackles referring expression generation (REG) in visually grounded dialogue by proposing a two-stage framework: first, a multimodal generator (IDEFICS) produces contextually appropriate REs conditioned on dialogue history and referent image; second, a discourse-aware comprehension-guiding (CRDG) reranks candidates to maximize discriminative power within the dialogue. Discrimination is quantified by composing TIM and ITM scores from a pretrained discriminative VLM, with a pooled score $S_i = w_{a_i} \cdot \ln(a_i + \varepsilon) + w_{b_i} \cdot \ln(b_i + \varepsilon)$ and weights $w_{a_i} = \frac{2}{3}$, $w_{b_i} = \frac{1}{3}$, selecting the candidate with the highest $S_i$. The approach is validated on the AGOS dataset, demonstrating that CRDG-guided reranking yields higher text-image retrieval accuracy than greedy decoding, with human evaluations corroborating improved discriminative performance. The work highlights the value of discourse-aware evaluation in REG and provides LoRA-tuned weights and materials for reproducibility. Limitations include language scope, dataset size, and reliance on a closed-source CRDG setup, suggesting directions for multilingual and larger-scale multimodal studies.

Abstract

We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.

Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

TL;DR

and weights

, selecting the candidate with the highest

. The approach is validated on the AGOS dataset, demonstrating that CRDG-guided reranking yields higher text-image retrieval accuracy than greedy decoding, with human evaluations corroborating improved discriminative performance. The work highlights the value of discourse-aware evaluation in REG and provides LoRA-tuned weights and materials for reproducibility. Limitations include language scope, dataset size, and reliance on a closed-source CRDG setup, suggesting directions for multilingual and larger-scale multimodal studies.

Abstract

Paper Structure (21 sections, 1 equation, 8 figures, 7 tables)

This paper contains 21 sections, 1 equation, 8 figures, 7 tables.

Introduction
Related work
Method
Task description
Proposed approach
Multimodal conditioning with IDEFICS
Comprehension guiding with the CRDG
Experiments
Data
Evaluation
Metrics
Human
Comparisons
Implementation details
Results
...and 6 more sections

Figures (8)

Figure 1: Excerpt (simplified) taken from a dialogue collected by willemsen_collecting_2022.
Figure 2: Visualization of the proposed two-stage, four-step framework. The first stage concerns (1) the autoregressive generation of candidate REs where the input to the REG model is the preceding linguistic context of the RE and an image representing the referent. In the second stage, candidate REs are (2) inserted into the dialogue segment at the point at which they were generated, after which the segment is processed by the CRDG willemsen_resolving_2023 to generate referent descriptions. These referent descriptions are (3) used to evaluate the discourse-dependent discriminatory power of the candidate REs by using a pretrained VLM to produce TIM and ITM scores, which are then (4) weighted to arrive at a composite score for each candidate RE; the highest-scoring candidate RE is selected.
Figure 3: Images of dogs for the example in Appendix \ref{['sec:appendix-reranking']} to illustrate the rationale behind weighted reranking.
Figure 4: Average RE length per round. Shown are ground truth REs taken from the dialogues (blue), REs generated by the fine-tuned IDEFICS model using greedy decoding (orange), and REs selected based on our weighted reranking (green). Error bars indicate 95% bootstrapped confidence intervals.
Figure 5: Example of an item shown to participants during the human evaluation study.
...and 3 more figures

Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

TL;DR

Abstract

Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding

Authors

TL;DR

Abstract

Table of Contents

Figures (8)