Table of Contents
Fetching ...

Contextual inference from single objects in Vision-Language models

Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig

Abstract

How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures

Contextual inference from single objects in Vision-Language models

Abstract

How much scene context a single object carries is a well-studied question in human scene perception, yet how this capacity is organized in vision-language models (VLMs) remains poorly understood, with direct implications for the robustness of these models. We investigate this question through a systematic behavioral and mechanistic analysis of contextual inference from single objects. Presenting VLMs with single objects on masked backgrounds, we probe their ability to infer both fine-grained scene category and coarse superordinate context (indoor vs. outdoor). We found that single objects support above-chance inference at both levels, with performance modulated by the same object properties that predict human scene categorization. Object identity, scene, and superordinate predictions are partially dissociable: accurate inference at one level neither requires nor guarantees accurate inference at the others, and the degree of coupling differs markedly across models. Mechanistically, object representations that remain stable when background context is removed are more predictive of successful contextual inference. Scene and superordinate schemas are grounded in fundamentally different ways: scene identity is encoded in image tokens throughout the network, while superordinate information emerges only late or not at all. Together, these results reveal that the organization of contextual inference in VLMs is more complex than accuracy alone suggests, with behavioral and mechanistic signatures

Paper Structure

This paper contains 33 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Experimental design. Each image condition (full scene or object-only) is paired with three prompt types: (A) scene category, (B) superordinate category, and (C) object identity. We analyze the intermediate image token representations corresponding to the objects (shown in black).
  • Figure 2: Classification accuracy across image conditions and tasks. Bars show accuracy under the full-scene (original image), and object-only (single foreground object on a masked background) conditions. For scene classification, solid bars indicate normal accuracy (exact label match) and hatched bars indicate relaxed accuracy (any semantically valid label accepted). Error bars denote the standard error of the mean.
  • Figure 3: Log-odds coefficients from multivariate logistic regression predicting classification accuracy in the object-only condition, as a function of object size, frequency (Freq.), specificity (Spec.), and object type (anchor vs. local). Error bars denote 95% confidence intervals. Significance levels: $^{**}p < .01$, $^{***}p < .001$.
  • Figure 4: Representational stability of object-patch tokens across transformer layers. Left: Mean cosine similarity between hidden-state activations of object patches under the full-scene and object-only conditions, averaged across all images, for LLaVA and InternVL. Higher values indicate that object representations are less modulated by the presence of background context. Center and right: Difference in mean cosine similarity between correctly and incorrectly classified trials ($\Delta$ cosine similarity) at each layer. Values above zero indicate that correctly classified trials have more stable object representations. Shaded regions denote 95% confidence intervals.
  • Figure 5: ROC-AUC between top-3 patch logit strength and binary classification accuracy at each transformer layer. Values above the dashed line indicate that logit strength in image patch tokens is predictive of classification accuracy at that layer. Markers indicate statistically significant layers.
  • ...and 2 more figures