Table of Contents
Fetching ...

Learning Humanoid Navigation from Human Data

Weizhuo Wang, Yanjie Ze, C. Karen Liu, Monroe Kennedy

Abstract

We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com

Learning Humanoid Navigation from Human Data

Abstract

We present EgoNav, a system that enables a humanoid robot to traverse diverse, unseen environments by learning entirely from 5 hours of human walking data, with no robot data or finetuning. A diffusion model predicts distributions of plausible future trajectories conditioned on past trajectory, a 360 deg visual memory fusing color, depth, and semantics, and video features from a frozen DINOv3 backbone that capture appearance cues invisible to depth sensors. A hybrid sampling scheme achieves real-time inference in 10 denoising steps, and a receding-horizon controller selects paths from the predicted distribution. We validate EgoNav through offline evaluations, where it outperforms baselines in collision avoidance and multi-modal coverage, and through zero-shot deployment on a Unitree G1 humanoid across unseen indoor and outdoor environments. Behaviors such as waiting for doors to open, navigating around crowds, and avoiding glass walls emerge naturally from the learned prior. We will release the dataset and trained models. Our website: https://egonav.weizhuowang.com

Paper Structure

This paper contains 12 sections, 2 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: EgoNav learns a navigation prior from human walking data: given past trajectory (purple) and a 360° visual context of the scene, a diffusion model generates a distribution of plausible future paths (red ribbons), with ribbon width indicating likelihood. The learned prior transfers directly to a Unitree G1 humanoid with zero robot data.
  • Figure 2: Overview of the proposed method: A rolling buffer of 32 segmented RGB frames and cleaned depth frames are combined to construct a single visual memory (VM). The VM is encoded into a 64-dimensional embedding, fused with DINOv3 video features, and concatenated with 6D pose as input to the diffusion model. Then the future trajectories are denoised. All input and output of the prediction module are in the egocentric frame.
  • Figure 3: Dataset: The dataset has a mix of weather, road, lighting, and traffic.
  • Figure 4: Comparing depth frame with visual memory: A raw depth frame has only $\sim$90° of FOV and misses important scene information. The depth frame sees only the open space ahead and does not capture the stairs, the right turn path, or the wall to the left. Black regions are areas not yet observed.
  • Figure 5: Channels in Visual Memory: The visual memory integrates past frames into a single panorama. It consists of a depth, color, and intensity-encoded 8-class semantic channel. 4 out of 8 channels are shown in the figure.
  • ...and 3 more figures