What's in Common? Multimodal Models Hallucinate When Reasoning Across Scenes
Candace Ross, Florian Bordes, Adina Williams, Polina Kirichenko, Mark Ibrahim
TL;DR
This work targets the gap between impressive perception benchmarks and real-world cross-scene reasoning by introducing Common-O Bench, a large, open-hearted evaluation with over $10.5k$ two-image examples (and $12k$ for Common-O Complex) that require identifying objects common to two scenes. By combining real and synthetic data, multiple camera views, and controlled object counts (up to 7 in Common-O and up to 16 in Common-O Complex), the authors test whether current multimodal models can reason across scenes and avoid hallucinations; they find that even the strongest models (e.g., GPT-4o) reach only $35\%$ accuracy on Common-O Bench and <$1\%$ on Common-O Complex, with high hallucination rates. The results show that object similarity, more than perception alone, degrades cross-image reasoning, and that models trained on multi-image inputs plus scaling offer the strongest improvements, though substantial gaps remain. The publicly released benchmark aims to spur research into robust cross-scene reasoning and to drive development of training paradigms that go beyond standard reward-based approaches to mitigate hallucinations in real-world, multi-image reasoning tasks.
Abstract
Multimodal language models possess a remarkable ability to handle an open-vocabulary's worth of objects. Yet the best models still suffer from hallucinations when reasoning about scenes in the real world, revealing a gap between their seemingly strong performance on existing perception benchmarks that are saturating and their reasoning in the real world. To address this gap, we build a novel benchmark of in-the-wild scenes that we call Common-O. With more than 10.5k examples using exclusively new images not found in web training data to avoid contamination, Common-O goes beyond just perception, inspired by cognitive tests for humans, to probe reasoning across scenes by asking "what's in common?". We evaluate leading multimodal language models, including models specifically trained to perform chain-of-thought reasoning. We find that perceiving objects in single images is tractable for most models, yet reasoning across scenes is very challenging even for the best models, including reasoning models. Despite saturating many leaderboards focusing on perception, the best performing model only achieves 35% on Common-O -- and on Common-O Complex, consisting of more complex scenes, the best model achieves only 1%. Curiously, we find models are more prone to hallucinate when similar objects are present in the scene, suggesting models may be relying on object co-occurrence seen during training. Among the models we evaluated, we found scale can provide modest improvements while models explicitly trained with multi-image inputs show bigger improvements, suggesting scaled multi-image training may offer promise. We make our benchmark publicly available to spur research into the challenge of hallucination when reasoning across scenes.
