$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models
Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang
TL;DR
M3-Verse introduces a two-state, paired-video benchmark to probe state-transition reasoning in large multimodal models, addressing the limitations of static-state evaluations. It combines 270 indoor scenes, 2,932 QA pairs across 50+ subtasks, and hallucination-type questions to stress-test models, evaluated across 16 SOTA LMMs with a simple baseline, HCTR. The results reveal a substantial human-vs-model gap and show that model size alone does not guarantee better state-change understanding; vision input and architectural design matter more in many cases. The authors further propose Hierarchical Captioning and Text-based Reasoning (HCTR) as an effective baseline that converts multimodal inputs into time-stamped textual narratives, significantly boosting inter-state reasoning and offering a practical path toward more robust dynamic visual understanding.
Abstract
Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.
