Table of Contents
Fetching ...

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

Dvij Kalaria, Sudarshan S Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, Jonathan Chung-Kuan Huang

TL;DR

DreamControl addresses the challenge of learning autonomous whole-body humanoid skills by fusing a diffusion-based human motion prior with reinforcement learning. A diffusion prior guided by text and spatiotemporal cues generates reference trajectories, which are then retargeted to a Unitree G1 and used to train a goal-conditioned RL policy to execute tasks in simulation and transfer to real hardware. The approach yields more natural, stable, and task-consistent motions than baselines, with strong sim2real performance and broad task coverage. By reducing reliance on teleoperation data and leveraging abundant human motion data, DreamControl offers a data-efficient route to scalable, scene-interacting humanoid control across diverse morphologies and tasks.

Abstract

We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural looking motions, aiding in sim-to-real transfer. We validate DreamControl's effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction. Project website at https://genrobo.github.io/DreamControl/

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

TL;DR

DreamControl addresses the challenge of learning autonomous whole-body humanoid skills by fusing a diffusion-based human motion prior with reinforcement learning. A diffusion prior guided by text and spatiotemporal cues generates reference trajectories, which are then retargeted to a Unitree G1 and used to train a goal-conditioned RL policy to execute tasks in simulation and transfer to real hardware. The approach yields more natural, stable, and task-consistent motions than baselines, with strong sim2real performance and broad task coverage. By reducing reliance on teleoperation data and leveraging abundant human motion data, DreamControl offers a data-efficient route to scalable, scene-interacting humanoid control across diverse morphologies and tasks.

Abstract

We introduce DreamControl, a novel methodology for learning autonomous whole-body humanoid skills. DreamControl leverages the strengths of diffusion models and Reinforcement Learning (RL): our core innovation is the use of a diffusion prior trained on human motion data, which subsequently guides an RL policy in simulation to complete specific tasks of interest (e.g., opening a drawer or picking up an object). We demonstrate that this human motion-informed prior allows RL to discover solutions unattainable by direct RL, and that diffusion models inherently promote natural looking motions, aiding in sim-to-real transfer. We validate DreamControl's effectiveness on a Unitree G1 robot across a diverse set of challenging tasks involving simultaneous lower and upper body control and object interaction. Project website at https://genrobo.github.io/DreamControl/

Paper Structure

This paper contains 49 sections, 24 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Unitree G1 humanoid performing various skills trained via DreamControl, including (1) opening a drawer, (2) bimanual pick (of a box), (3) ordinary pick and (4) pressing an elevator button.
  • Figure 2: DreamControl Overview: (A) we first generate text and spatiotemporally guided human motion trajectories using diffusion; (B) we train goal-conditioned RL policies to track these generated trajectories while completing some task of interest; (C) we deploy these policies to a real humanoid, leveraging off-the-shelf vision models to determine spatial guidance inputs for the RL policy.
  • Figure 3: Comparison of trajectories for the task of Jump. The top row shows results from the TaskOnly+ baseline, while the bottom row illustrates trajectories from the DreamControl. The yellow sphere depicts the spatial control point used to guide the trajectories.