Table of Contents
Fetching ...

E-SDS: Environment-aware See it, Do it, Sorted - Automated Environment-Aware Reinforcement Learning for Humanoid Locomotion

Enis Yalcin, Joshua O'Hara, Maria Stamatopoulou, Chengxu Zhou, Dimitrios Kanoulas

TL;DR

The paper tackles the bottleneck of manual reward engineering in reinforcement learning for humanoid locomotion by introducing E-SDS, a framework that conditions vision-language model–generated rewards on real-time terrain statistics from exteroceptive sensors. It integrates Grid-Frame Prompting and SUS prompting within an environment-aware reward-generation agent and an iterative training/refinement loop, enabling robust perceptive policies. In simulation on a Unitree G1 across simple, gap, obstacle, and stairs terrains, E-SDS outperforms manually designed rewards and perception-blind baselines, demonstrating stair descent and substantially reduced velocity-tracking errors (51.9–82.6%) with around 99 minutes of terrain-specific training per case. The work highlights the value of environment-aware reward synthesis for scalable, autonomous skill acquisition, while acknowledging sim-to-real transfer and per-terrain specialization as areas for future work.

Abstract

Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.

E-SDS: Environment-aware See it, Do it, Sorted - Automated Environment-Aware Reinforcement Learning for Humanoid Locomotion

TL;DR

The paper tackles the bottleneck of manual reward engineering in reinforcement learning for humanoid locomotion by introducing E-SDS, a framework that conditions vision-language model–generated rewards on real-time terrain statistics from exteroceptive sensors. It integrates Grid-Frame Prompting and SUS prompting within an environment-aware reward-generation agent and an iterative training/refinement loop, enabling robust perceptive policies. In simulation on a Unitree G1 across simple, gap, obstacle, and stairs terrains, E-SDS outperforms manually designed rewards and perception-blind baselines, demonstrating stair descent and substantially reduced velocity-tracking errors (51.9–82.6%) with around 99 minutes of terrain-specific training per case. The work highlights the value of environment-aware reward synthesis for scalable, autonomous skill acquisition, while acknowledging sim-to-real transfer and per-terrain specialization as areas for future work.

Abstract

Vision-language models (VLMs) show promise in automating reward design in humanoid locomotion, which could eliminate the need for tedious manual engineering. However, current VLM-based methods are essentially "blind", as they lack the environmental perception required to navigate complex terrain. We present E-SDS (Environment-aware See it, Do it, Sorted), a framework that closes this perception gap. E-SDS integrates VLMs with real-time terrain sensor analysis to automatically generate reward functions that facilitate training of robust perceptive locomotion policies, grounded by example videos. Evaluated on a Unitree G1 humanoid across four distinct terrains (simple, gaps, obstacles, stairs), E-SDS uniquely enabled successful stair descent, while policies trained with manually-designed rewards or a non-perceptive automated baseline were unable to complete the task. In all terrains, E-SDS also reduced velocity tracking error by 51.9-82.6%. Our framework reduces the human effort of reward design from days to less than two hours while simultaneously producing more robust and capable locomotion policies.

Paper Structure

This paper contains 13 sections, 3 equations, 3 figures, 4 tables, 2 algorithms.

Figures (3)

  • Figure 1: E-SDS pipeline showing the automated reward generation and refinement.
  • Figure 2: Velocity tracking error between E-SDS (red), Foundation (green), Baseline (Purple).
  • Figure 3: Evaluation tasks in Isaac Lab. Simple (top left), gap (top right), obstacle (bottom left), stairs (bottom right).