Ego-Vision World Model for Humanoid Contact Planning
Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath
TL;DR
The paper addresses the challenge of enabling humanoid robots to exploit contact in unstructured environments by coupling a scalable ego-centric visual world model with a sampling-based MPC guided by a learned surrogate value function, trained entirely on demonstration-free offline data. The approach learns latent dynamics and predictive values in a compact latent space, uses a value-guided MPC with a large candidate set and a short horizon, and deploys in real-time on a physical humanoid. Key contributions include a stable offline training pipeline with reconstruction, joint-embedding predictive, and Q-losses; a latent-state architecture separating dynamics and observation information; and empirical validation showing data efficiency, multi-task capability, and robust real-world performance across wall-support, ball-blocking, and arch-traversal tasks. The framework advances data-efficient, vision-based planning for contact-rich humanoid control with practical implications for autonomous operation in cluttered, dynamic environments.
Abstract
Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved data efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Website: https://ego-vcp.github.io/
