GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

Yufei Liu; Xieyuanli Chen; Hainan Pan; Chenghao Shi; Yanjie Chen; Kaihong Huang; Zhiwen Zeng; Huimin Lu

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

Yufei Liu, Xieyuanli Chen, Hainan Pan, Chenghao Shi, Yanjie Chen, Kaihong Huang, Zhiwen Zeng, Huimin Lu

TL;DR

GeoLoco is a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM).

Abstract

The prevailing paradigm of perceptive humanoid locomotion relies heavily on active depth sensors. However, this depth-centric approach fundamentally discards the rich semantic and dense appearance cues of the visual world, severing low-level control from the high-level reasoning essential for general embodied intelligence. While monocular RGB offers a ubiquitous, information-dense alternative, end-to-end reinforcement learning from raw 2D pixels suffers from extreme sample inefficiency and catastrophic sim-to-real collapse due to the inherent loss of geometric scale. To break this deadlock, we propose GeoLoco, a purely RGB-driven locomotion framework that conceptualizes monocular images as high-dimensional 3D latent representations by harnessing the powerful geometric priors of a frozen, scale-aware Visual Foundation Model (VFM). Rather than naive feature concatenation, we design a proprioceptive-query multi-head cross-attention mechanism that dynamically attends to task-critical topological features conditioned on the robot's real-time gait phase. Crucially, to prevent the policy from overfitting to superficial textures, we introduce a dual-head auxiliary learning scheme. This explicit regularization forces the high-dimensional latent space to strictly align with the physical terrain geometry, ensuring robust zero-shot sim-to-real transfer. Trained exclusively in simulation, GeoLoco achieves robust zero-shot transfer to the Unitree G1 humanoid and successfully negotiates challenging terrains.

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

TL;DR

Abstract

Paper Structure (26 sections, 7 equations, 6 figures, 4 tables)

This paper contains 26 sections, 7 equations, 6 figures, 4 tables.

Introduction
Related Work
Bottlenecks in Perceptive Locomotion
Visual Representations: From 2D Priors to 3D Priors
Sim-to-Real Transfer for Visual Policy
Method
Problem Formulation
Geometry-Prior Visual Representation from RGB
Multi-Scale Tokenized Feature Extraction
Channel-Grouped Spatial Projection
Temporal Alignment and Asynchronous Inference
Multi-Head Cross-Attention Fusion
Spatio-Temporal Token Formulation
Proprioceptive-Query Mechanism
Physical Interpretation of the Attention Score
...and 11 more sections

Figures (6)

Figure 1: Overview of GeoLoco. Our framework enables robust humanoid locomotion on diverse terrains—including stairs, ramps, and uneven blocks—using only a monocular RGB camera. By conceptualizing 2D pixels as high-dimensional 3D geometric representations, GeoLoco eliminates the dependency on active depth sensors (LiDAR/Depth) while maintaining expert-level traversal performance.
Figure 2: The GeoLoco Architecture. A frozen, scale-aware Visual Geometry Encoder extracts multi-scale 3D priors from asynchronous RGB streams (10 Hz). These features are fused with high-frequency proprioceptive (50 Hz) via a Multi-Head Cross-Attention mechanism. To bridge the sim-to-real gap, a dual-head auxiliary decoder (middle right) regularizes the latent space by reconstructing local terrain topography and predicting system dynamics during training.
Figure 3: Multi-Scale geometric activations from the frozen VFM. We visualize patch token activations from intermediate transformer layers (4/8/12).
Figure 4: Visualization of attention heatmap produced by the proprioceptive-query cross-attention module for a representative stair scene.
Figure 5: Training environments featuring diverse geometries and randomized textures. This prevents overfitting to visual appearances, ensuring the policy learns robust 3D priors for sim-to-real transfer.
...and 1 more figures

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

TL;DR

Abstract

GeoLoco: Leveraging 3D Geometric Priors from Visual Foundation Model for Robust RGB-Only Humanoid Locomotion

Authors

TL;DR

Abstract

Table of Contents

Figures (6)