Table of Contents
Fetching ...

Scaling Up Dynamic Human-Scene Interaction Modeling

Nan Jiang, Zhiyuan Zhang, Hongjie Li, Xiaoxuan Ma, Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Siyuan Huang

TL;DR

The paper tackles the scarcity of high‑quality 3D human–scene interaction data and the challenge of long‑horizon, controllable motion synthesis. It introduces TRUMANS, the largest motion‑captured HSI dataset to date, featuring over 15 hours of interactions across 100 indoor scenes with whole‑body body motion and object dynamics, plus photorealistic RGBD renderings and per‑frame contact annotations. It then proposes a diffusion‑based autoregressive model conditioned on 3D scene context and frame‑wise action labels, leveraging a Local Scene Perceiver and frame‑wise action encoding to generate arbitrary‑length motions in real time. Through extensive static/dynamic evaluations and zero‑shot transfer to unseen scenes, the method achieves high realism, strong scene adherence, and competitive or superior performance to state‑of‑the‑art baselines, while also improving image‑based perception tasks when paired with real data. The work provides a scalable, controllable, and transferable framework for HSI modeling with broad implications for robotics, simulation, and perception research.

Abstract

Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.

Scaling Up Dynamic Human-Scene Interaction Modeling

TL;DR

The paper tackles the scarcity of high‑quality 3D human–scene interaction data and the challenge of long‑horizon, controllable motion synthesis. It introduces TRUMANS, the largest motion‑captured HSI dataset to date, featuring over 15 hours of interactions across 100 indoor scenes with whole‑body body motion and object dynamics, plus photorealistic RGBD renderings and per‑frame contact annotations. It then proposes a diffusion‑based autoregressive model conditioned on 3D scene context and frame‑wise action labels, leveraging a Local Scene Perceiver and frame‑wise action encoding to generate arbitrary‑length motions in real time. Through extensive static/dynamic evaluations and zero‑shot transfer to unseen scenes, the method achieves high realism, strong scene adherence, and competitive or superior performance to state‑of‑the‑art baselines, while also improving image‑based perception tasks when paired with real data. The work provides a scalable, controllable, and transferable framework for HSI modeling with broad implications for robotics, simulation, and perception research.

Abstract

Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.
Paper Structure (58 sections, 12 equations, 8 figures, 8 tables)

This paper contains 58 sections, 12 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of dataset dataset and our hsi framework. We introduce the most extensive motion-captured hsi dataset, featuring diverse hsi precisely captured in 100 scene configurations. Benefiting from dataset, we propose a novel method for generation of hsi with arbitrary length, surpassing all baselines and exhibiting superb zero-shot generalizability.
  • Figure 2: Data augmentation for motion generation. This example highlights how human motion is adjusted to accommodate variations in object sizes. Specifically, the chair's height is increased, and the bed's height is decreased, each by $15$cm. Our augmentation method proficiently modifies human motion to maintain consistent interactions despite these changes in object dimensions.
  • Figure 3: Model architecture. (a) Our model employs an autoregressive diffusion sampling approach to generate arbitrary long-sequence motions. (b) Within each episode, we synthesize motion using DDPM integrated with a transformer architecture, taking the human joint locations as input. (c)(d) Action and scene conditions are encoded and forwarded to the first token, guiding the motion synthesis process.
  • Figure 4: Visualization of motion generation. Leveraging local scene context and action instructions as conditions, our method demonstrates its proficiency in (a) initiating motion given the surrounding environment, (b) dynamically interacting with objects, (c) avoiding collisions during motion progression, and (d) robustly synthesizing long-term motion. The depicted scenes are selected from PROX, Replica, and FRONT3D-test datasets, none of which were included in the training phase. For qualitative results, please refer to the Supplementary Video.
  • Figure A1: Additional qualitative results of 3D contact estimation.
  • ...and 3 more figures