VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

Osher Azulay; Zhengjie Xu; Andrew Scheffer; Stella X. Yu

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

Osher Azulay, Zhengjie Xu, Andrew Scheffer, Stella X. Yu

TL;DR

This work trains a privileged teacher using sparse human demonstrations on flat terrain and simulated complex terrains, and distill it into a deployable student that relies only on egocentric depth and proprioception, and demonstrates robust, zero-shot fall safety across diverse non-flat environments without real-world fine-tuning.

Abstract

Reliable fall recovery is critical for humanoids operating in cluttered environments. Unlike quadrupeds or wheeled robots, humanoids experience high-energy impacts, complex whole-body contact, and large viewpoint changes during a fall, making recovery essential for continued operation. Existing methods fragment fall safety into separate problems such as fall avoidance, impact mitigation, and stand-up recovery, or rely on end-to-end policies trained without vision through reinforcement learning or imitation learning, often on flat terrain. At a deeper level, fall safety is treated as monolithic data complexity, coupling pose, dynamics, and terrain and requiring exhaustive coverage, limiting scalability and generalization. We present a unified fall safety approach that spans all phases of fall recovery. It builds on two insights: 1) Natural human fall and recovery poses are highly constrained and transferable from flat to complex terrain through alignment, and 2) Fast whole-body reactions require integrated perceptual-motor representations. We train a privileged teacher using sparse human demonstrations on flat terrain and simulated complex terrains, and distill it into a deployable student that relies only on egocentric depth and proprioception. The student learns how to react by matching the teacher's goal-in-context latent representation, which combines the next target pose with the local terrain, rather than separately encoding what it must perceive and how it must act. Results in simulation and on a real Unitree G1 humanoid demonstrate robust, zero-shot fall safety across diverse non-flat environments without real-world fine-tuning. The project page is available at https://vigor2026.github.io/

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

TL;DR

Abstract

Paper Structure (41 sections, 8 equations, 11 figures, 10 tables)

This paper contains 41 sections, 8 equations, 11 figures, 10 tables.

Introduction
Related Work
Humanoid Fall Mitigation and Standing-Up Control
Visual Whole-Body Control
Motion Priors and Style-Constrained RL
Method
Motion Collection and Sparse Keyframe Extraction
Privileged Goal-in-Context Teacher
Reward Design
Egocentric Student Policy
Domain Randomization
Experiments
Implementation Details
Simulated Experiments
Metrics
...and 26 more sections

Figures (11)

Figure 2: Factorized data generation yields sample-efficient imitation and scalable adaptation for humanoid fall safety learning. Rather than treating pose, time, and terrain as a single monolithic data space requiring exhaustive coverage ( left), we generate the same space by factorizing it into a small set of human pose trajectories from real-world demonstrations on flat terrain ( middle) and independently varying terrain geometry in simulation ( right), which can be arbitrarily complex.
Figure 3: Overview of VIGOR.1) Motion retargeting: human fall--recovery demonstrations are kinematically retargeted to the robot. 2) Terrain alignment: reference poses are used directly on flat terrain and coarsely projected onto uneven terrain to provide sparse tracking targets. 3) Goal-in-context teacher policy learning: a privileged teacher policy is trained with RL to acquire a goal-in-context representation that encodes the immediate recovery target pose together with local terrain information. 4) Visual goal-in-context student distillation: a student policy distills the teacher’s terrain-aware recovery behavior from egocentric depth and short-term proprioceptive history for deployment.
Figure 4: Terrains used for training. From top to bottom: rough, waves, slope, inverted slope, stairs, and inverted stairs. The figure shows three representative difficulty levels per terrain for visualization.
Figure 5: Simulation performance, grouped by terrain and motion type. Top: success rate by terrain family. Bottom: success rate by initial fall direction, aggregated over terrains. The semi-transparent segment indicates unsafe successes. Results averaged over 300 trials per condition.
Figure 6: Recovery scenario examples. Each row shows a different initial condition and terrain, visualized with four key frames from left to right.
...and 6 more figures

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

TL;DR

Abstract

VIGOR: Visual Goal-In-Context Inference for Unified Humanoid Fall Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (11)