Table of Contents
Fetching ...

LVLMs are Bad at Overhearing Human Referential Communication

Zhengxiang Wang, Weiling Li, Panagiotis Kaliosis, Owen Rambow, Susan E. Brennan

TL;DR

The paper investigates how well large vision-language models (LVLMs) can serve as overhearers in spontaneous referential communication, a setting where grounding language to vision occurs without direct participation in the dialogue. Using a corpus of 80 human-human dialogues across repeated object-matching rounds, the authors evaluate seven LVLMs (proprietary and open-weight) on an overhearer task and analyze performance at single rounds, across rounds, and robustness to input variations. Key findings show that even state-of-the-art LVLMs struggle to ground naturalistic referring expressions and fail to improve with repeated overhearing, unlike humans who rapidly gain efficiency and accuracy through common ground. The work releases the corpus and code to enable reproducibility and future research, and identifies concrete factors—such as leveraging object-level descriptions—that can boost LVLM grounding, while underscoring the current limitations in accumulating knowledge across interaction histories. Overall, the study highlights important gaps in LVLMs’ ability to adapt to dynamic communicative conventions and provides a benchmark for future improvements in multimodal grounding and interaction.

Abstract

During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.

LVLMs are Bad at Overhearing Human Referential Communication

TL;DR

The paper investigates how well large vision-language models (LVLMs) can serve as overhearers in spontaneous referential communication, a setting where grounding language to vision occurs without direct participation in the dialogue. Using a corpus of 80 human-human dialogues across repeated object-matching rounds, the authors evaluate seven LVLMs (proprietary and open-weight) on an overhearer task and analyze performance at single rounds, across rounds, and robustness to input variations. Key findings show that even state-of-the-art LVLMs struggle to ground naturalistic referring expressions and fail to improve with repeated overhearing, unlike humans who rapidly gain efficiency and accuracy through common ground. The work releases the corpus and code to enable reproducibility and future research, and identifies concrete factors—such as leveraging object-level descriptions—that can boost LVLM grounding, while underscoring the current limitations in accumulating knowledge across interaction histories. Overall, the study highlights important gaps in LVLMs’ ability to adapt to dynamic communicative conventions and provides a benchmark for future improvements in multimodal grounding and interaction.

Abstract

During spontaneous conversations, speakers collaborate on novel referring expressions, which they can then re-use in subsequent conversations. Understanding such referring expressions is an important ability for an embodied agent, so that it can carry out tasks in the real world. This requires integrating and understanding language, vision, and conversational interaction. We study the capabilities of seven state-of-the-art Large Vision Language Models (LVLMs) as overhearers to a corpus of spontaneous conversations between pairs of human discourse participants engaged in a collaborative object-matching task. We find that such a task remains challenging for current LVLMs and they all fail to show a consistent performance improvement as they overhear more conversations from the same discourse participants repeating the same task for multiple rounds. We release our corpus and code for reproducibility and to facilitate future research.

Paper Structure

This paper contains 57 sections, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Our overhearer matching task (after Schober1989): the AI agent (LVLM) reads a transcript from a human referential communication corpus and tries to match the same cards as the matcher to the director's target sequence.
  • Figure 2: The left panel shows one target basket; the middle panel shows one pair's corresponding dialogue from Round 1 to Round 4, demonstrating entrainment on more concise language (for the perspective "rectangular-shaped"). Here, entrainment occurs after they consider multiple proposals in Round 1. The right panel depicts the mean word count (a measure of efficiency) for baskets and dogs across rounds. Error bars indicate $\pm$1 standard error of the mean across pairs.
  • Figure 3: Average accuracy of various LVLMs in the overhearer task over rounds. There are 4 overhearing starting points from Round 1 to Round 4, yielding three lines and one single point. The shaded areas and error bars denote 95% confidence intervals. In this corpus, all human matchers' performance is 100% at every round.
  • Figure 4: Accuracy boxplots of two best-performing LVLMs in the overhearer task for Round 1 conversations across 10 human pairs (whiskers denote 25th and 75th percentiles). Each boxplot represents 30 runs of a model, each with a different object ordering.
  • Figure 5: The 13 basket pictures from our corpus and an example input image for our experiments. The 10 target baskets are placed in the first two rows, numbered from 1 to 10, for illustration. Speakers in the original task did not see the numbers.
  • ...and 8 more figures