Table of Contents
Fetching ...

Ego-Vision World Model for Humanoid Contact Planning

Hang Liu, Yuman Gao, Sangli Teng, Yufeng Chi, Yakun Sophia Shao, Zhongyu Li, Maani Ghaffari, Koushil Sreenath

TL;DR

The paper addresses the challenge of enabling humanoid robots to exploit contact in unstructured environments by coupling a scalable ego-centric visual world model with a sampling-based MPC guided by a learned surrogate value function, trained entirely on demonstration-free offline data. The approach learns latent dynamics and predictive values in a compact latent space, uses a value-guided MPC with a large candidate set and a short horizon, and deploys in real-time on a physical humanoid. Key contributions include a stable offline training pipeline with reconstruction, joint-embedding predictive, and Q-losses; a latent-state architecture separating dynamics and observation information; and empirical validation showing data efficiency, multi-task capability, and robust real-world performance across wall-support, ball-blocking, and arch-traversal tasks. The framework advances data-efficient, vision-based planning for contact-rich humanoid control with practical implications for autonomous operation in cluttered, dynamic environments.

Abstract

Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved data efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Website: https://ego-vcp.github.io/

Ego-Vision World Model for Humanoid Contact Planning

TL;DR

The paper addresses the challenge of enabling humanoid robots to exploit contact in unstructured environments by coupling a scalable ego-centric visual world model with a sampling-based MPC guided by a learned surrogate value function, trained entirely on demonstration-free offline data. The approach learns latent dynamics and predictive values in a compact latent space, uses a value-guided MPC with a large candidate set and a short horizon, and deploys in real-time on a physical humanoid. Key contributions include a stable offline training pipeline with reconstruction, joint-embedding predictive, and Q-losses; a latent-state architecture separating dynamics and observation information; and empirical validation showing data efficiency, multi-task capability, and robust real-world performance across wall-support, ball-blocking, and arch-traversal tasks. The framework advances data-efficient, vision-based planning for contact-rich humanoid control with practical implications for autonomous operation in cluttered, dynamic environments.

Abstract

Enabling humanoid robots to exploit physical contact, rather than simply avoid collisions, is crucial for autonomy in unstructured environments. Traditional optimization-based planners struggle with contact complexity, while on-policy reinforcement learning (RL) is sample-inefficient and has limited multi-task ability. We propose a framework combining a learned world model with sampling-based Model Predictive Control (MPC), trained on a demonstration-free offline dataset to predict future outcomes in a compressed latent space. To address sparse contact rewards and sensor noise, the MPC uses a learned surrogate value function for dense, robust planning. Our single, scalable model supports contact-aware tasks, including wall support after perturbation, blocking incoming objects, and traversing height-limited arches, with improved data efficiency and multi-task capability over on-policy RL. Deployed on a physical humanoid, our system achieves robust, real-time contact planning from proprioception and ego-centric depth images. Website: https://ego-vcp.github.io/

Paper Structure

This paper contains 18 sections, 16 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: An illustration of our framework in the "Support the Wall" task. When subjected to a sudden perturbation (left), the robot uses its learned world model to predict and plan a stabilizing action within its planning horizon (center). This allows it to successfully execute the plan and brace its hands against the wall to make contact and maintain balance (right).
  • Figure 2: World Model Training Pipeline. The pipeline begins with the offline data collection process shown in (a), where a dataset $\mathcal{D}$ of trajectories is generated by applying randomly sampled high-level actions (end-effector position $p_{ee}^{\top}$ and body height $h_{body}$) to a simulated humanoid equipped with a trained low-level policy. This dataset is then used to train the world model, as depicted in (b). At each timestep $t$, an observation $o_t$, consisting of a depth image and proprioception, is encoded into a stochastic latent state $z_t$, which is then decoded to produce a reconstruction $\hat{o}_t$. Concurrently, a recurrent network updates its latent state $h_t$ based on the previous state and action. From the combined latent state $(h_t, z_t)$, the model predicts (i) $\hat{z}_t$, a prior sample of the stochastic latent state; (ii) $\hat{d}_t$, the termination probability; and (iii) $\hat{Q}_t$ a surrogate action-value guiding the planner in evaluating different actions. All of these predictions are optimized against the ground-truth data from the offline trajectories, enabling the model to learn both the environment's dynamics and a robust value function for planning.
  • Figure 3: Value-Guided Sampling MPC. This figure illustrates how the trained world model is used for planning via value-guided sampling MPC. This process performs open-loop prediction to find the best action sequence starting from a single real observation. At inference time, this process begins by encoding the current observation $o_t$ into its latent state $z_t$, after which the planner samples a batch of $M=1024$ candidate action sequences over a planning horizon of $N = 4$ steps. The world model predicts the future latent state ($h_{t+k}, \hat{z}_{t+k}$) by recursively applying its learned dynamics model. At each prediction step, the surrogate value ($\hat{Q}_{t+k}$) evaluates the sampled actions, while the termination signal, $\hat{d}_{t}$, predicts the probability of robot failure, such as falling; if this probability exceeds a threshold of 0.9, all subsequent value estimates, $\hat{Q}$, for that trajectory are set to zero. The planner evaluates $M$ candidate trajectories, where the score for each trajectory is calculated by the objective function $\hat{J}_N$ in Eq. \ref{['eq: final-obj']}. This set of scored trajectories is then optimized using the Cross-Entropy Method (CEM) to find the optimal action sequence.
  • Figure 4: Real-World experiments validating the proposed framework. (a) A demonstration of sequential task execution and generalization, where the robot traverses an arch (i) and then blocks a previously unseen box (ii). (b) Support the wall to maintain balance by bracing the wall with the hands when pushed towards the wall. (c) Blocking both an in-distribution ball (with a size consistent with the training data) and an unseen box; (d) Squat and traverse an arch.
  • Figure 5: Sample efficiency comparison: Our method uses an offline dataset collected from random action, while PPO collects data from environments at every iteration. The x-axis represents the number of step transitions used, while the y-axis shows the reward for each task. A greater value on the x-axis indicates a larger amount of data used, and a higher value on the y-axis signifies better performance. While our method utilizes a dataset of at most 1M steps, we continued to train PPO for a greater number of steps to determine when it could achieve comparable performance.
  • ...and 2 more figures