Table of Contents
Fetching ...

Multimodal Language Models Cannot Spot Spatial Inconsistencies

Om Khangaonkar, Hadi J. Rad, Hamed Pirsiavash

Abstract

Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

Multimodal Language Models Cannot Spot Spatial Inconsistencies

Abstract

Spatial consistency is a fundamental property of the visual world and a key requirement for models that aim to understand physical reality. Despite recent advances, multimodal large language models (MLLMs) often struggle to reason about 3D geometry across multiple views. Rather than asking models to describe scene attributes, we introduce a more challenging task: given two views of the same scene, identify the object that violates 3D motion consistency. We propose a simple and scalable method for generating realistic, spatially inconsistent image pairs from multi-view scenes, enabling systematic evaluation of this capability. Our results show that state-of-the-art MLLMs significantly underperform human observers and exhibit substantial variability across different scene attributes, revealing a fragile and incomplete understanding of 3D structure. We hope our findings underscore the need for approaches that develop a more deeply grounded understanding of the physical world.

Paper Structure

This paper contains 21 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Synthesizing spatially inconsistent image pairs from multi-view data. Given three views $(V_1,V_2,V_3)$ of the same static scene, we (1) select an object $O$ visible in all views, (2) Erase $O$ in view $V_2$ and inpaint to obtain a clean background, and (3) paste the instance of $O$ from view $V_3$ back into $V_2$ at its original location. Because $V_3$ is captured from a different camera pose, the pasted object has an appearance that is incompatible with the 3D geometry implied by $V_1 \rightarrow V_2$, while all other objects remain consistent. This yields realistic image pairs with controlled spatial violations and no manual annotation. We label objects in $V_1$ for our forced-choice evaluation.
  • Figure 2: Example spatial inconsistencies. AI labels are ordered as GPT-5 (LR), Gemini 2.5 Pro (MR) and Qwen3-VL 8B Instruct. Zoom in to see labels in detail.
  • Figure 3: Model accuracy varies across scene attributes, but much less across the number of labels per pair. Left: We report accuracy for identifying the single spatially inconsistent object, stratified by inconsistent object depth (close/medium/far), average pair brightness (dark/medium/bright), and the augmented object's physical plausibility. While humans remain comparatively robust across conditions, models can show substantial sensitivity to changes in the pair's scene composition. Right: We partition the dataset based on the number of labels per pair and report accuracy. Interestingly, this has a minor effect on humans but not much of one on models. Dashed line represents random chance.
  • Figure 4: Accuracy varies greatly across the inconsistent object or pair scene categories. While humans are relatively consistent across all settings, the models show large amounts of variance. This suggests that their 3D understanding is brittle across our diverse visual world. Dashed line represents random chance.