Table of Contents
Fetching ...

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

Yude Zou, Junji Gong, Xing Gao, Zixuan Li, Tianxing Chen, Guanjie Zheng

Abstract

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

Abstract

Human-object-scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human-object interaction (HOI) and human-scene interaction (HSI), HOSI generation requires reasoning over dynamic object-scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse-to-fine instruction-conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump-aware guidance that mitigates collisions and penetrations during sampling without requiring fine-grained scene geometry, enabling real-time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo-HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high-fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Project page: https://yudezou.github.io/InfBaGel-page/

Paper Structure

This paper contains 25 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of InfBaGel. Our method operates through an iterative refinement process. (a) Auto-regressive Motion Model generating arbitrary long-sequence motions conditioned on textual instructions, goals, object geometry, and scene context. (b) Dynamic Perception Encoder perceives the evolving environment with the temporal-aligned scene state updated by iterative sampling. (c) Bump-aware Guidance detects collisions and directs iterative, collision-free sampling. (d) Hybrid Data training enables robust zero-shot generalization to complex realistic scenes.
  • Figure 2: Qualitative comparison. Top 2 rows: Comparison on human-object interaction in scenes. Bottom row: Comparison on a complex multi-stage task involving moving a chair and then sitting on it.
  • Figure 3: Qualitative comparison in ablation study. Replacing/removing specific modules: (a) diffusion model instead of consistency model, (b) static perception instead of dynamic perception, and (c) without bump-aware guidance, all resulted in collisions with the scene.
  • Figure 4: Qualitative results on different scenes, motion types and object types. The top two rows (a/b) show diverse human-object interactions including lifting over head and kicking. The last row shows a static scene interaction.
  • Figure 5: Qualitative results in socially-interactive scenes, including a store and a physical therapy room. These unseen scenes are chosen from LINGO, displayed in default white due to the lack of texture.