Table of Contents
Fetching ...

VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation

Tairan He, Zi Wang, Haoru Xue, Qingwei Ben, Zhengyi Luo, Wenli Xiao, Ye Yuan, Xingye Da, Fernando Castañeda, Shankar Sastry, Changliu Liu, Guanya Shi, Linxi Fan, Yuke Zhu

TL;DR

VIRAL presents a scalable, RGB-driven sim-to-real framework for humanoid loco-manipulation by training a privileged-state teacher in simulation and distilling its behavior into a vision-based student. The approach combines delta-action commands, reference-state initialization, and a mixed DAgger-BC training regime, supported by large-scale GPU simulation and extensive domain/real-to-sim randomization. Real-world experiments on a Unitree G1 show near-expert reliability and strong generalization across varied environments, with ablations pinpointing critical design choices. The work delivers practical guidelines for scaling visual sim-to-real, highlighting the importance of compute, randomization, and system alignment to enable autonomous humanoid manipulation without real-world fine-tuning.

Abstract

A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.

VIRAL: Visual Sim-to-Real at Scale for Humanoid Loco-Manipulation

TL;DR

VIRAL presents a scalable, RGB-driven sim-to-real framework for humanoid loco-manipulation by training a privileged-state teacher in simulation and distilling its behavior into a vision-based student. The approach combines delta-action commands, reference-state initialization, and a mixed DAgger-BC training regime, supported by large-scale GPU simulation and extensive domain/real-to-sim randomization. Real-world experiments on a Unitree G1 show near-expert reliability and strong generalization across varied environments, with ablations pinpointing critical design choices. The work delivers practical guidelines for scaling visual sim-to-real, highlighting the importance of compute, randomization, and system alignment to enable autonomous humanoid manipulation without real-world fine-tuning.

Abstract

A key barrier to the real-world deployment of humanoid robots is the lack of autonomous loco-manipulation skills. We introduce VIRAL, a visual sim-to-real framework that learns humanoid loco-manipulation entirely in simulation and deploys it zero-shot to real hardware. VIRAL follows a teacher-student design: a privileged RL teacher, operating on full state, learns long-horizon loco-manipulation using a delta action space and reference state initialization. A vision-based student policy is then distilled from the teacher via large-scale simulation with tiled rendering, trained with a mixture of online DAgger and behavior cloning. We find that compute scale is critical: scaling simulation to tens of GPUs (up to 64) makes both teacher and student training reliable, while low-compute regimes often fail. To bridge the sim-to-real gap, VIRAL combines large-scale visual domain randomization over lighting, materials, camera parameters, image quality, and sensor delays--with real-to-sim alignment of the dexterous hands and cameras. Deployed on a Unitree G1 humanoid, the resulting RGB-based policy performs continuous loco-manipulation for up to 54 cycles, generalizing to diverse spatial and appearance variations without any real-world fine-tuning, and approaching expert-level teleoperation performance. Extensive ablations dissect the key design choices required to make RGB-based humanoid loco-manipulation work in practice.

Paper Structure

This paper contains 33 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: VIRAL teacher-student pipeline.Phase 1: In simulation, a privileged RL teacher policy $\pi_{\text{teacher}}$ receives full-state proprioception and exteroception of the task information and outputs WBC commands. Phase 2: A vision-based student policy $\pi_{\text{student}}$ observes only RGB images and sim-to-real proprioception and is trained to imitate the teacher policy via DAgger and behavior cloning.
  • Figure 2: Visual randomization on image, lighting, material, and camera-extrinsics randomization for sim-to-real robustness.
  • Figure 3: Frames of reference state initialization for teacher RL.
  • Figure 4: System identification of the dexterous hand. Real–sim overlays (top) and joint position trajectories (bottom) before and after SysID, showing markedly improved alignment.
  • Figure 5: Real-to-sim camera extrinsics alignment. Real view versus simulated views before and after alignment.
  • ...and 10 more figures