Table of Contents
Fetching ...

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

Ruiping Liu, Junwei Zheng, Yufan Chen, Zirui Wang, Kunyu Peng, Kailun Yang, Jiaming Zhang, Marc Pollefeys, Rainer Stiefelhagen

TL;DR

The paper tackles the challenge of holistic situated 3D change understanding in dynamic environments by introducing Situat3DChange, a large-scale real-world dataset that combines perception tasks (short QA and change descriptions) with action tasks (rearrangement instructions) under a perception–action framework. It builds 11K human-annotated interpretations of environmental changes and augments them with egocentric and allocentric perspectives using an LLM, enabling scalable situated data generation. To efficiently compare highly similar 3D scenes, the authors propose SCReasoner, a token-efficient 3D multimodal LLM that fuses two point-cloud streams with minimal overhead, outperforming 2D baselines and improving cross-domain transfer when fine-tuned on Situat3DChange. Extensive experiments demonstrate progress and limitations of MLLMs in dynamic scene understanding, with scaling and cross-domain transfer showing the dataset’s practical value for developing perceptually aligned embodied agents.

Abstract

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.

Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model

TL;DR

The paper tackles the challenge of holistic situated 3D change understanding in dynamic environments by introducing Situat3DChange, a large-scale real-world dataset that combines perception tasks (short QA and change descriptions) with action tasks (rearrangement instructions) under a perception–action framework. It builds 11K human-annotated interpretations of environmental changes and augments them with egocentric and allocentric perspectives using an LLM, enabling scalable situated data generation. To efficiently compare highly similar 3D scenes, the authors propose SCReasoner, a token-efficient 3D multimodal LLM that fuses two point-cloud streams with minimal overhead, outperforming 2D baselines and improving cross-domain transfer when fine-tuned on Situat3DChange. Extensive experiments demonstrate progress and limitations of MLLMs in dynamic scene understanding, with scaling and cross-domain transfer showing the dataset’s practical value for developing perceptually aligned embodied agents.

Abstract

Physical environments and circumstances are fundamentally dynamic, yet current 3D datasets and evaluation benchmarks tend to concentrate on either dynamic scenarios or dynamic situations in isolation, resulting in incomplete comprehension. To overcome these constraints, we introduce Situat3DChange, an extensive dataset supporting three situation-aware change understanding tasks following the perception-action model: 121K question-answer pairs, 36K change descriptions for perception tasks, and 17K rearrangement instructions for the action task. To construct this large-scale dataset, Situat3DChange leverages 11K human observations of environmental changes to establish shared mental models and shared situational awareness for human-AI collaboration. These observations, enriched with egocentric and allocentric perspectives as well as categorical and coordinate spatial relations, are integrated using an LLM to support understanding of situated changes. To address the challenge of comparing pairs of point clouds from the same scene with minor changes, we propose SCReasoner, an efficient 3D MLLM approach that enables effective point cloud comparison with minimal parameter overhead and no additional tokens required for the language decoder. Comprehensive evaluation on Situat3DChange tasks highlights both the progress and limitations of MLLMs in dynamic scene and situation understanding. Additional experiments on data scaling and cross-domain transfer demonstrate the task-agnostic effectiveness of using Situat3DChange as a training dataset for MLLMs.

Paper Structure

This paper contains 37 sections, 2 equations, 20 figures, 13 tables.

Figures (20)

  • Figure 1: Perception for comprehensive understanding of the dynamic scene with situational awareness, including concise QA and change description.
  • Figure 2: Action for rearrangement instructions to revert changes from the current situation.
  • Figure 3: Different senses of relative spatial directions between robots and humans result in different mental maps.
  • Figure 4: Allocentric and egocentric information used for generating situated QA pairs, change descriptions, and rearrangement instructions.
  • Figure 5: Distinctive features used to refer to objects for generating queries about change descriptions and rearrangement instructions. $\bigtriangleup$, $\bigcirc$, $\square$, and $$ refer to chairs, table, cup, and sofa.
  • ...and 15 more figures