Table of Contents
Fetching ...

Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

Junsheng Huang, Shengyu Hao, Bocheng Hu, Hongwei Wang, Gaoang Wang

TL;DR

EgoDynamic4D tackles the challenge of understanding highly dynamic 4D scenes from an egocentric perspective by introducing a QA-focused benchmark with 927K paired questions and explicit Chain-of-Thought explanations. It combines RGB-D video, camera poses, and dense 4D annotations into a unified dataset across 12 task types, and proposes an end-to-end framework that compresses long 4D sequences into $d_{vis}$-dimensional tokens through instance-aware encoding, temporal encoding, and adaptive octree down-sampling for LLM compatibility. The method demonstrates superior spatio-temporal reasoning on EgoDynamic4D compared with baselines, validating multimodal temporal modeling for embodied perception. This work provides a scalable foundation for interpretable egocentric 4D reasoning with broad implications for robotics, AR, and autonomous systems.

Abstract

Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

Understanding Dynamic Scenes in Ego Centric 4D Point Clouds

TL;DR

EgoDynamic4D tackles the challenge of understanding highly dynamic 4D scenes from an egocentric perspective by introducing a QA-focused benchmark with 927K paired questions and explicit Chain-of-Thought explanations. It combines RGB-D video, camera poses, and dense 4D annotations into a unified dataset across 12 task types, and proposes an end-to-end framework that compresses long 4D sequences into -dimensional tokens through instance-aware encoding, temporal encoding, and adaptive octree down-sampling for LLM compatibility. The method demonstrates superior spatio-temporal reasoning on EgoDynamic4D compared with baselines, validating multimodal temporal modeling for embodied perception. This work provides a scalable foundation for interpretable egocentric 4D reasoning with broad implications for robotics, AR, and autonomous systems.

Abstract

Understanding dynamic 4D scenes from an egocentric perspective-modeling changes in 3D spatial structure over time-is crucial for human-machine interaction, autonomous navigation, and embodied intelligence. While existing egocentric datasets contain dynamic scenes, they lack unified 4D annotations and task-driven evaluation protocols for fine-grained spatio-temporal reasoning, especially on motion of objects and human, together with their interactions. To address this gap, we introduce EgoDynamic4D, a novel QA benchmark on highly dynamic scenes, comprising RGB-D video, camera poses, globally unique instance masks, and 4D bounding boxes. We construct 927K QA pairs accompanied by explicit Chain-of-Thought (CoT), enabling verifiable, step-by-step spatio-temporal reasoning. We design 12 dynamic QA tasks covering agent motion, human-object interaction, trajectory prediction, relation understanding, and temporal-causal reasoning, with fine-grained, multidimensional metrics. To tackle these tasks, we propose an end-to-end spatio-temporal reasoning framework that unifies dynamic and static scene information, using instance-aware feature encoding, time and camera encoding, and spatially adaptive down-sampling to compress large 4D scenes into token sequences manageable by LLMs. Experiments on EgoDynamic4D show that our method consistently outperforms baselines, validating the effectiveness of multimodal temporal modeling for egocentric dynamic scene understanding.

Paper Structure

This paper contains 30 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: We introduce a novel QA benchmark EgoDynamic4D and an end-to-end spatio-temporal reasoning framework.
  • Figure 3: QA generation pipeline. Given RGB-D sequences with aligned 3D bounding boxes and poses, we extract spatial-temporal properties, apply template-based CoT reasoning, and refine questions via LLMs and human validation.
  • Figure 4: Time span and motion distance distribution.
  • Figure 5: The end-to-end framework which encode dynamic 4D scenes based on the egocentric videos.
  • Figure 6: Qualitative examples illustrating the complexity of spatial-temporal reasoning in EgoDynamic4D.