Table of Contents
Fetching ...

VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

Shaofeng Yin, Yanjie Ze, Hong-Xing Yu, C. Karen Liu, Jiajun Wu

TL;DR

VisualMimic tackles humanoid loco-manipulation in unstructured environments by fusing egocentric vision with whole-body control in a visual sim-to-real framework. It leverages a two-level hierarchy: a general low-level keypoint tracker learned from human motion via a teacher-student pipeline and a task-specific high-level generator trained in simulation and distilled to vision-based control. To stabilize training, it injects noise into low-level commands and clips high-level actions within the human motion space, enabling zero-shot transfer to real hardware. The method demonstrates versatile loco-manipulation tasks, including lifting, pushing, dribbling, and kicking, with demonstrated outdoor robustness.

Abstract

Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: https://visualmimic.github.io .

VisualMimic: Visual Humanoid Loco-Manipulation via Motion Tracking and Generation

TL;DR

VisualMimic tackles humanoid loco-manipulation in unstructured environments by fusing egocentric vision with whole-body control in a visual sim-to-real framework. It leverages a two-level hierarchy: a general low-level keypoint tracker learned from human motion via a teacher-student pipeline and a task-specific high-level generator trained in simulation and distilled to vision-based control. To stabilize training, it injects noise into low-level commands and clips high-level actions within the human motion space, enabling zero-shot transfer to real hardware. The method demonstrates versatile loco-manipulation tasks, including lifting, pushing, dribbling, and kicking, with demonstrated outdoor robustness.

Abstract

Humanoid loco-manipulation in unstructured environments demands tight integration of egocentric perception and whole-body control. However, existing approaches either depend on external motion capture systems or fail to generalize across diverse tasks. We introduce VisualMimic, a visual sim-to-real framework that unifies egocentric vision with hierarchical whole-body control for humanoid robots. VisualMimic combines a task-agnostic low-level keypoint tracker -- trained from human motion data via a teacher-student scheme -- with a task-specific high-level policy that generates keypoint commands from visual and proprioceptive input. To ensure stable training, we inject noise into the low-level policy and clip high-level actions using human motion statistics. VisualMimic enables zero-shot transfer of visuomotor policies trained in simulation to real humanoid robots, accomplishing a wide range of loco-manipulation tasks such as box lifting, pushing, football dribbling, and kicking. Beyond controlled laboratory settings, our policies also generalize robustly to outdoor environments. Videos are available at: https://visualmimic.github.io .

Paper Structure

This paper contains 23 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: VisualMimic consists of two training stages: 1) training a general keypoint tracker, where a teacher motion tracker is first trained and then distilled into a keypoint tracker with keypoint commands; and (2) training a task-specific keypoint generator, where a teacher policy with privileged object states is first trained and then distilled into a visuomotor policy. To ensure stable learning, we compute statistics with human motions and use them to clip high-level actions. Here, $o_t$ is the proprioceptive observation at time $t$, $a_t$ is the action, and $s_{\text{obj}}$ represents the object state.
  • Figure 2: Our visuomotor policies generalize across diverse space and time, shown on the box-pushing task.
  • Figure 3: Real-world deployment of visuomotor policies on a humanoid, showcasing diverse loco-manipulation tasks: Lift Box, Kick Ball, and Kick Box.
  • Figure 4: Visuomotor policies perform diverse loco-manipulation tasks in simulation: from left to right, Balance Ball, Push Cube, Reach Box, Large Kick.
  • Figure 5: Box-kicking behaviors. With our teacher–student training (ours, top), the humanoid can mimic human-like motion, while training without it leads to non-human-like motion (bottom).
  • ...and 3 more figures