LVLMs are Bad at Overhearing Human Referential Communication
Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, Owen Rambow, Susan E. Brennan
TL;DR
The paper investigates how well large vision-language models (LVLMs) can serve as overhearers in spontaneous referential communication, a setting where grounding language to vision occurs without direct participation in the dialogue. Using a corpus of 80 human-human dialogues across repeated object-matching rounds, the authors evaluate seven LVLMs (proprietary and open-weight) on an overhearer task and analyze performance at single rounds, across rounds, and robustness to input variations. Key findings show that even state-of-the-art LVLMs struggle to ground naturalistic referring expressions and fail to improve with repeated overhearing, unlike humans who rapidly gain efficiency and accuracy through common ground. The work releases the corpus and code to enable reproducibility and future research, and identifies concrete factors—such as leveraging object-level descriptions—that can boost LVLM grounding, while underscoring the current limitations in accumulating knowledge across interaction histories. Overall, the study highlights important gaps in LVLMs’ ability to adapt to dynamic communicative conventions and provides a benchmark for future improvements in multimodal grounding and interaction.
Abstract
During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.
