Table of Contents
Fetching ...

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, Qifeng Chen, Jingbo Wang, Jiangmiao Pang

TL;DR

PhysHSI tackles real-world humanoid-scene interaction by unifying simulation-trained AMP-based policies with a coarse-to-fine perception system for robust object localization. The approach leverages retargeted MoCap data, an adversarial discriminator to guide motion realism, and a hybrid reference-state initialization to enable efficient exploration across long-horizon tasks. Real-world experiments show zero-shot transfer across carry, sit, lie, and stand tasks with strong generalization and natural motion, including outdoor deployment, supported by domain randomization and robust perception. The work provides a practical path toward generalizable, lifelike humanoid interactions and offers comprehensive evaluation protocols and insights for future real-world HSI research.

Abstract

Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

TL;DR

PhysHSI tackles real-world humanoid-scene interaction by unifying simulation-trained AMP-based policies with a coarse-to-fine perception system for robust object localization. The approach leverages retargeted MoCap data, an adversarial discriminator to guide motion realism, and a hybrid reference-state initialization to enable efficient exploration across long-horizon tasks. Real-world experiments show zero-shot transfer across carry, sit, lie, and stand tasks with strong generalization and natural motion, including outdoor deployment, supported by domain randomization and robust perception. The work provides a practical path toward generalizable, lifelike humanoid interactions and offers comprehensive evaluation protocols and insights for future real-world HSI research.

Abstract

Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.

Paper Structure

This paper contains 40 sections, 22 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overview of PhysHSI. (a) Dataset Preparation: Human motions from a MoCap dataset are retargeted to humanoid motions, and objects are annotated by identifying key contact frames. (b) AMP Policy Training: A discriminator distinguishes between policy-generated and reference motions to facilitate learning of natural behaviors and task completion. (c) Real-World Deployment: The coarse object position is manually specified using LiDAR visualization, and combined with odometry for coarse localization when the object is outside the camera's FOV. Once within view, AprilTag detection combined with odometry is used for fine-grained, automated localization.
  • Figure 2: Spatial Generalization. Root trajectories of the robot are shown for tasks (a) Carry Box and (b) Lie Down. Red trajectories indicate reference data, with others representing sampled policy motions.
  • Figure 3: Real-World Generalization.PhysHSI generalizes to diverse real-world scenes, (a) handling boxes of varying shapes, weights, and heights, and (b) sitting or (c) lying on chairs and beds of different heights, both indoors and outdoors.
  • Figure 4: Real-World Localization System Analysis. (a) Localization error versus robot–object distance, with coarse-to-fine transition statistics and distribution. (b) A representative object localization trajectory, highlighting three stages: (i) coarse localization, (ii) fine localization, and (iii) grasp.
  • Figure 5: 3D model of the D455 camera bracket.