Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication
Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
TL;DR
The paper addresses how Vision-Language Models leverage scene context when naming objects, proposing COOCo, a dataset with graded object–scene relatedness and controlled visual noise. By evaluating multiple state-of-the-art VLMs and conducting in-depth analyses on LLaVA-OneVision, it shows that models adaptively rely on scene semantics: context can distract when targets are incongruent but facilitate recognition when targets align with the scene, and performance degrades as semantic relatedness weakens. An in-depth attention analysis reveals that successful naming correlates with increased mid-layer attention to targets and a non-monotonic relation between scene fit and attention, suggesting a dynamic balance between local and contextual processing. The work is validated through replication on VISIONS and provides interpretable insights into cross-modal attention, with implications for improving REG robustness under semantic violations and noise.
Abstract
To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the $\textit{Common Objects Out-of-Context (COOCo)}$ dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and high fit. These results suggest that VLMs dynamically balance local and contextual information for reference generation. Dataset and code are available here: $\href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}$.
