Table of Contents
Fetching ...

What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim

TL;DR

This work targets the gap between impressive perception benchmarks and real-world cross-scene reasoning by introducing Common-O Bench, a large, open-hearted evaluation with over $10.5k$ two-image examples (and $12k$ for Common-O Complex) that require identifying objects common to two scenes. By combining real and synthetic data, multiple camera views, and controlled object counts (up to 7 in Common-O and up to 16 in Common-O Complex), the authors test whether current multimodal models can reason across scenes and avoid hallucinations; they find that even the strongest models (e.g., GPT-4o) reach only $35\%$ accuracy on Common-O Bench and <$1\%$ on Common-O Complex, with high hallucination rates. The results show that object similarity, more than perception alone, degrades cross-image reasoning, and that models trained on multi-image inputs plus scaling offer the strongest improvements, though substantial gaps remain. The publicly released benchmark aims to spur research into robust cross-scene reasoning and to drive development of training paradigms that go beyond standard reward-based approaches to mitigate hallucinations in real-world, multi-image reasoning tasks.

Abstract

Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking "what's in common?". We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O -- and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.

What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes

TL;DR

This work targets the gap between impressive perception benchmarks and real-world cross-scene reasoning by introducing Common-O Bench, a large, open-hearted evaluation with over two-image examples (and for Common-O Complex) that require identifying objects common to two scenes. By combining real and synthetic data, multiple camera views, and controlled object counts (up to 7 in Common-O and up to 16 in Common-O Complex), the authors test whether current multimodal models can reason across scenes and avoid hallucinations; they find that even the strongest models (e.g., GPT-4o) reach only accuracy on Common-O Bench and < on Common-O Complex, with high hallucination rates. The results show that object similarity, more than perception alone, degrades cross-image reasoning, and that models trained on multi-image inputs plus scaling offer the strongest improvements, though substantial gaps remain. The publicly released benchmark aims to spur research into robust cross-scene reasoning and to drive development of training paradigms that go beyond standard reward-based approaches to mitigate hallucinations in real-world, multi-image reasoning tasks.

Abstract

Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking "what's in common?". We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O -- and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.

Paper Structure

This paper contains 33 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Reasoning across scenes is an open challenge for today's best multimodal models. We show the best performance from the Open VLM leaderboard on MMBench and single image evaluations from our benchmark illustrating saturation for perception tasks.
  • Figure 2: Common-O Bench contains real and synthetic images of objects in different orientations and configurations. These are randomly selected examples from the dataset along with the human ground truth labels for the common object(s) between them.
  • Figure 3: Performance for single image object perception in yellow and multi-image reasoning in red for (a) accuracy and (b) hallucination rates. We observe models with higher accuracy tend to also have lower rates of hallucination. We include a table of these results, along with statistical analyses, in \ref{['sec:stat-analysis-results']}.
  • Figure 4: These are two examples of model failures, with the specific failures shown in red.
  • Figure 5: Performance on Common-O Bench subsetted according to whether example image pairs are real or synthetic. The height of each bar represents the total accuracy on Common-O Bench: the green area of the bar represents the contribution of the real image accuracy, and the blue portion of the bar represents the contribution of the the synthetic portion. Models tend to have higher performance on real images (larger green area) than on synthetic ones (smaller blue area). However, the difference in performance on the two subsets decreases as overall accuracy (bar height) decreases, with the DeepSeek-VL2 family, the PerceptionLM family, Llama 3.2 Instruct 11B, and Llava-OneVision 7B, having only a small difference between the two subsets.
  • ...and 2 more figures