Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model
Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Marc Pollefeys, Rainer Stiefelhagen
TL;DR
The paper tackles the challenge of holistic situated 3D change understanding in dynamic environments by introducing Situat3DChange, a large-scale real-world dataset that combines perception tasks (short QA and change descriptions) with action tasks (rearrangement instructions) under a perception–action framework. It builds 11K human-annotated interpretations of environmental changes and augments them with egocentric and allocentric perspectives using an LLM, enabling scalable situated data generation. To efficiently compare highly similar 3D scenes, the authors propose SCReasoner, a token-efficient 3D multimodal LLM that fuses two point-cloud streams with minimal overhead, outperforming 2D baselines and improving cross-domain transfer when fine-tuned on Situat3DChange. Extensive experiments demonstrate progress and limitations of MLLMs in dynamic scene understanding, with scaling and cross-domain transfer showing the dataset’s practical value for developing perceptually aligned embodied agents.
Abstract
Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.
