PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System
Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, Qifeng Chen, Jingbo Wang, Jiangmiao Pang
TL;DR
PhysHSI tackles real-world humanoid-scene interaction by unifying simulation-trained AMP-based policies with a coarse-to-fine perception system for robust object localization. The approach leverages retargeted MoCap data, an adversarial discriminator to guide motion realism, and a hybrid reference-state initialization to enable efficient exploration across long-horizon tasks. Real-world experiments show zero-shot transfer across carry, sit, lie, and stand tasks with strong generalization and natural motion, including outdoor deployment, supported by domain randomization and robust perception. The work provides a practical path toward generalizable, lifelike humanoid interactions and offers comprehensive evaluation protocols and insights for future real-world HSI research.
Abstract
Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.
