Table of Contents
Fetching ...

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

Kewei Wei, Bocheng Hu, Jie Cao, Xiaohan Chen, Zhengxi Lu, Wubing Xia, Weili Xu, Jiaao Wu, Junchen He, Mingyu Jia, Ciyun Zhao, Ye Sun, Yizhi Li, Zhonghan Zhao, Jian Zhang, Gaoang Wang

TL;DR

M3-Verse introduces a two-state, paired-video benchmark to probe state-transition reasoning in large multimodal models, addressing the limitations of static-state evaluations. It combines 270 indoor scenes, 2,932 QA pairs across 50+ subtasks, and hallucination-type questions to stress-test models, evaluated across 16 SOTA LMMs with a simple baseline, HCTR. The results reveal a substantial human-vs-model gap and show that model size alone does not guarantee better state-change understanding; vision input and architectural design matter more in many cases. The authors further propose Hierarchical Captioning and Text-based Reasoning (HCTR) as an effective baseline that converts multimodal inputs into time-stamped textual narratives, significantly boosting inter-state reasoning and offering a practical path toward more robust dynamic visual understanding.

Abstract

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce $M^3-Verse$, a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. $M^3-Verse$ thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

$M^3-Verse$: A "Spot the Difference" Challenge for Large Multimodal Models

TL;DR

M3-Verse introduces a two-state, paired-video benchmark to probe state-transition reasoning in large multimodal models, addressing the limitations of static-state evaluations. It combines 270 indoor scenes, 2,932 QA pairs across 50+ subtasks, and hallucination-type questions to stress-test models, evaluated across 16 SOTA LMMs with a simple baseline, HCTR. The results reveal a substantial human-vs-model gap and show that model size alone does not guarantee better state-change understanding; vision input and architectural design matter more in many cases. The authors further propose Hierarchical Captioning and Text-based Reasoning (HCTR) as an effective baseline that converts multimodal inputs into time-stamped textual narratives, significantly boosting inter-state reasoning and offering a practical path toward more robust dynamic visual understanding.

Abstract

Modern Large Multimodal Models (LMMs) have demonstrated extraordinary ability in static image and single-state spatial-temporal understanding. However, their capacity to comprehend the dynamic changes of objects within a shared spatial context between two distinct video observations, remains largely unexplored. This ability to reason about transformations within a consistent environment is particularly crucial for advancements in the field of spatial intelligence. In this paper, we introduce , a Multi-Modal, Multi-State, Multi-Dimensional benchmark, to formally evaluate this capability. It is built upon paired videos that provide multi-perspective observations of an indoor scene before and after a state change. The benchmark contains a total of 270 scenes and 2,932 questions, which are categorized into over 50 subtasks that probe 4 core capabilities. We evaluate 16 state-of-the-art LMMs and observe their limitations in tracking state transitions. To address these challenges, we further propose a simple yet effective baseline that achieves significant performance improvements in multi-state perception. thus provides a challenging new testbed to catalyze the development of next-generation models with a more holistic understanding of our dynamic visual world. You can get the construction pipeline from https://github.com/Wal-K-aWay/M3-Verse_pipeline and full benchmark data from https://www.modelscope.cn/datasets/WalKaWay/M3-Verse.

Paper Structure

This paper contains 41 sections, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Limitation of current LMM benchmarks. Existing benchmarks concentrate on evaluating a model's performance on an isolated state, neglecting to test its ability to understand changes across different states. Furthermore, the lack of relevant data and benchmarks has impeded the development of model capabilities in this area. This work is designed to fill that gap.
  • Figure 2: Overview of M3-Verse. Intra-State: can be answered using information from a single state, Inter-State: requires information from both 'before' and 'after' states to be answered. These categories are further structured to evaluate four key capabilities: Spatial Understanding, Temporal Understanding, Attribute Recognition and Reasoning.
  • Figure 3: M3-Verse Statistic. (a) M3-Verse are classifiable into five primary categories according to the task, which are then further subdivided into more than 50 types of sub-tasks. (b) M3-Verse falls into four main categories, each including intra-state and inter-state tasks. A single question isn't designed to test just one capability, but rather to evaluate multiple abilities in a compound manner.
  • Figure 4: An analysis of the impact of the number of sampled video frames on overall score. The results are shown for two models: (a) Qwen3-VL-4B-Instruct and (b) Video-XL-2 Video-XL-2.
  • Figure 5: Performance comparison of various models with and without Hierarchical Captioning and Text-based Reasoning, measured by (a) Overall Score and (b) Inter-state Average.
  • ...and 17 more figures