HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model
Dongting Li, Xingyu Chen, Qianyang Wu, Bo Chen, Sikai Wu, Hanyu Wu, Guoyao Zhang, Liang Li, Mingliang Zhou, Diyun Xiang, Jianzhu Ma, Qiang Zhang, Renjing Xu
TL;DR
HAIC tackles the challenge of robust humanoid interaction with underactuated objects under visual occlusion by introducing a dynamics-aware world model that predicts high-order object states from proprioception and explicitly projects these states onto a static geometric prior to form a dynamic occupancy map. The framework comprises an Object Adapter, Explicit Geometric Projection, and a Privilege Adapter, integrated through a two-stage asymmetric distillation training regime with EMA stabilization to bridge sim-to-real gaps. Key contributions include the Dynamics-Aware World Model, Asymmetric Adaptive Distillation, and a multimodal contact reward, validated by real-world experiments showing state-of-the-art performance on skateboarding and cart manipulation, as well as multi-terrain carrying and long-horizon tasks. The results demonstrate proactive inertia compensation and improved stability, enabling zero-shot generalization across object size, terrain orientation, and external perturbations, which is significant for deploying agile, sensor-limited humanoids in unstructured environments. Throughout, all state and objective terms are framed with $...$ notation to reflect the quantitative emphasis of the evaluation metrics such as $E_{mpbpe}$, $E_{mpboe}$, and $E_{mpjpe}$.
Abstract
Humanoid robots show promise for complex whole-body tasks in unstructured environments. Although Human-Object Interaction (HOI) has advanced, most methods focus on fully actuated objects rigidly coupled to the robot, ignoring underactuated objects with independent dynamics and non-holonomic constraints. These introduce control challenges from coupling forces and occlusions. We present HAIC, a unified framework for robust interaction across diverse object dynamics without external state estimation. Our key contribution is a dynamics predictor that estimates high-order object states (velocity, acceleration) solely from proprioceptive history. These predictions are projected onto static geometric priors to form a spatially grounded dynamic occupancy map, enabling the policy to infer collision boundaries and contact affordances in blind spots. We use asymmetric fine-tuning, where a world model continuously adapts to the student policy's exploration, ensuring robust state estimation under distribution shifts. Experiments on a humanoid robot show HAIC achieves high success rates in agile tasks (skateboarding, cart pushing/pulling under various loads) by proactively compensating for inertial perturbations, and also masters multi-object long-horizon tasks like carrying a box across varied terrain by predicting the dynamics of multiple objects.
