Table of Contents
Fetching ...

Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication

Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt

TL;DR

The paper addresses how Vision-Language Models leverage scene context when naming objects, proposing COOCo, a dataset with graded object–scene relatedness and controlled visual noise. By evaluating multiple state-of-the-art VLMs and conducting in-depth analyses on LLaVA-OneVision, it shows that models adaptively rely on scene semantics: context can distract when targets are incongruent but facilitate recognition when targets align with the scene, and performance degrades as semantic relatedness weakens. An in-depth attention analysis reveals that successful naming correlates with increased mid-layer attention to targets and a non-monotonic relation between scene fit and attention, suggesting a dynamic balance between local and contextual processing. The work is validated through replication on VISIONS and provides interpretable insights into cross-modal attention, with implications for improving REG robustness under semantic violations and noise.

Abstract

To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the $\textit{Common Objects Out-of-Context (COOCo)}$ dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and high fit. These results suggest that VLMs dynamically balance local and contextual information for reference generation. Dataset and code are available here: $\href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}$.

Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication

TL;DR

The paper addresses how Vision-Language Models leverage scene context when naming objects, proposing COOCo, a dataset with graded object–scene relatedness and controlled visual noise. By evaluating multiple state-of-the-art VLMs and conducting in-depth analyses on LLaVA-OneVision, it shows that models adaptively rely on scene semantics: context can distract when targets are incongruent but facilitate recognition when targets align with the scene, and performance degrades as semantic relatedness weakens. An in-depth attention analysis reveals that successful naming correlates with increased mid-layer attention to targets and a non-monotonic relation between scene fit and attention, suggesting a dynamic balance between local and contextual processing. The work is validated through replication on VISIONS and provides interpretable insights into cross-modal attention, with implications for improving REG robustness under semantic violations and noise.

Abstract

To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and high fit. These results suggest that VLMs dynamically balance local and contextual information for reference generation. Dataset and code are available here: .

Paper Structure

This paper contains 47 sections, 34 figures, 7 tables.

Figures (34)

  • Figure 2: A set of COOCo images from the "cubicle office" scene category with target object laptop. The target is removed from the original image ('clean') and replaced with objects of the same type ('generated'), as well as targets with high-, low- and medium relatedness to the scene.
  • Figure 3: RefCLIPScores per model at noise level 0 across relatedness conditions. Models perform best in the original and same target conditions, with performance declining as target relatedness decreases.
  • Figure 4: RefCLIPScore and Accuracy across experimental conditions, aggregated over all models. When differences within a condition are minimal, the leftmost point corresponds to the higher value, followed by lower values to the right.
  • Figure 5: Semantic similarity between the model’s outputs and scene labels for LLaVA-OneVision-0.5B in COOCo (ours) and VISIONS allegretti_visual_2025, shown across correctness (blue: correct; pink/red: incorrect), semantic-fit conditions (COOCo: high/low; VISIONS: congruent/incongruent), noise areas (target, context, all), and noise levels (0.5, 1.0). Dashed and solid horizontal lines indicate baseline similarity at zero noise for incorrect and correct predictions, respectively.
  • Figure 9: Average relatedness scores grouped by relatedness level, illustrating the semantic similarity between scenes and candidate objects.
  • ...and 29 more figures