Table of Contents
Fetching ...

Context informs pragmatic interpretation in vision-language models

Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael C. Frank

TL;DR

This paper addresses how context informs pragmatic interpretation in vision-language models during multi-turn reference tasks. It employs iterated reference games with tangram stimuli to probe open-weight models and humans under eight context conditions. Key findings show that while models struggle without context, exposure to same-game, relevant prompts enables rapid improvement toward human-like accuracy around $0.8$, though calibration remains imperfect. The work highlights the potential of context-dependent learning in vision-language models and suggests directions for improving prompt design, attention mechanisms, and generation capabilities to achieve robust, pragmatic language understanding.

Abstract

Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.

Context informs pragmatic interpretation in vision-language models

TL;DR

This paper addresses how context informs pragmatic interpretation in vision-language models during multi-turn reference tasks. It employs iterated reference games with tangram stimuli to probe open-weight models and humans under eight context conditions. Key findings show that while models struggle without context, exposure to same-game, relevant prompts enables rapid improvement toward human-like accuracy around , though calibration remains imperfect. The work highlights the potential of context-dependent learning in vision-language models and suggests directions for improving prompt design, attention mechanisms, and generation capabilities to achieve robust, pragmatic language understanding.

Abstract

Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.

Paper Structure

This paper contains 18 sections, 11 figures, 2 tables.

Figures (11)

  • Figure 1: A. Overview of the experimental structure for the original interactive games in Boyce et al. boyceInteractionStructureConstrains2024. B. User interface for each trial in the experiments with naïve participants.
  • Figure 2: Matcher accuracy across all conditions and matcher types (both human and model), with best-fit LOESS curves, shown by repetition number as seen by the matcher, except for the no context condition where repetition number is from the original game. Error bars indicate bootstrapped 95% confidence intervals. Dashed lines indicate the chance level (0.083).
  • Figure 3: Comparison between naïve human accuracy and model accuracy, with best-fit linear regressions. Shaded regions indicate bootstrapped 95% confidence intervals. Dashed lines indicate perfect calibration ($y = x$).
  • Figure 4: The image that was presented to the vision-language models, containing all 12 tangram shapes with their letter labels.
  • Figure 5: Matcher accuracy for the shuffled, backward, random, and no context conditions across all matcher types (both human and model), with best-fit LOESS curves, shown by repetition number from the original game. Error bars indicate bootstrapped 95% confidence intervals. Dashed lines indicate the chance level (0.083).
  • ...and 6 more figures