Context informs pragmatic interpretation in vision-language models
Alvin Wei Ming Tan, Ben Prystawski, Veronica Boyce, Michael C. Frank
TL;DR
This paper addresses how context informs pragmatic interpretation in vision-language models during multi-turn reference tasks. It employs iterated reference games with tangram stimuli to probe open-weight models and humans under eight context conditions. Key findings show that while models struggle without context, exposure to same-game, relevant prompts enables rapid improvement toward human-like accuracy around $0.8$, though calibration remains imperfect. The work highlights the potential of context-dependent learning in vision-language models and suggests directions for improving prompt design, attention mechanisms, and generation capabilities to achieve robust, pragmatic language understanding.
Abstract
Iterated reference games - in which players repeatedly pick out novel referents using language - present a test case for agents' ability to perform context-sensitive pragmatic reasoning in multi-turn linguistic environments. We tested humans and vision-language models on trials from iterated reference games, varying the given context in terms of amount, order, and relevance. Without relevant context, models were above chance but substantially worse than humans. However, with relevant context, model performance increased dramatically over trials. Few-shot reference games with abstract referents remain a difficult task for machine learning models.
