Table of Contents
Fetching ...

Hierarchical visuomotor control of humanoids

Josh Merel, Arun Ahuja, Vu Pham, Saran Tunyasuvunakool, Siqi Liu, Dhruva Tirumala, Nicolas Heess, Greg Wayne

TL;DR

This work tackles the problem of integrated visuomotor control for high-DoF humanoids by proposing a hierarchical architecture in which a memory- and vision-based high-level controller coordinates multiple low-level motor controllers pretrained from motion-capture data. It systematically compares steerable, switching, and cold-switching control paradigms for low-level skills, showing that discrete selection among control fragments provides the strongest performance on vision-based locomotion and memory tasks. Key findings include successful Go-to-target, Walls, Gaps, Forage, and Heterogeneous Forage tasks in MuJoCo, with memory-enabled behavior emerging in the memory task, and analyses indicating that the high-level policy leverages perceptual cues like ball borders and walls. The approach scales motor skill reuse without heavy hand-engineering, marking a step toward flexible, vision-guided, memory-augmented humanoids, while highlighting ongoing challenges such as automation of transitions and reduction of movement artifacts.

Abstract

We aim to build complex humanoid agents that integrate perception, motor control, and memory. In this work, we partly factor this problem into low-level motor control from proprioception and high-level coordination of the low-level skills informed by vision. We develop an architecture capable of surprisingly flexible, task-directed motor control of a relatively high-DoF humanoid body by combining pre-training of low-level motor controllers with a high-level, task-focused controller that switches among low-level sub-policies. The resulting system is able to control a physically-simulated humanoid body to solve tasks that require coupling visual perception from an unstabilized egocentric RGB camera during locomotion in the environment. For a supplementary video link, see https://youtu.be/7GISvfbykLE .

Hierarchical visuomotor control of humanoids

TL;DR

This work tackles the problem of integrated visuomotor control for high-DoF humanoids by proposing a hierarchical architecture in which a memory- and vision-based high-level controller coordinates multiple low-level motor controllers pretrained from motion-capture data. It systematically compares steerable, switching, and cold-switching control paradigms for low-level skills, showing that discrete selection among control fragments provides the strongest performance on vision-based locomotion and memory tasks. Key findings include successful Go-to-target, Walls, Gaps, Forage, and Heterogeneous Forage tasks in MuJoCo, with memory-enabled behavior emerging in the memory task, and analyses indicating that the high-level policy leverages perceptual cues like ball borders and walls. The approach scales motor skill reuse without heavy hand-engineering, marking a step toward flexible, vision-guided, memory-augmented humanoids, while highlighting ongoing challenges such as automation of transitions and reduction of movement artifacts.

Abstract

We aim to build complex humanoid agents that integrate perception, motor control, and memory. In this work, we partly factor this problem into low-level motor control from proprioception and high-level coordination of the low-level skills informed by vision. We develop an architecture capable of surprisingly flexible, task-directed motor control of a relatively high-DoF humanoid body by combining pre-training of low-level motor controllers with a high-level, task-focused controller that switches among low-level sub-policies. The resulting system is able to control a physically-simulated humanoid body to solve tasks that require coupling visual perception from an unstabilized egocentric RGB camera during locomotion in the environment. For a supplementary video link, see https://youtu.be/7GISvfbykLE .

Paper Structure

This paper contains 27 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Training settings for explicit training of transition-capable controllers. Panel A depicts a cartoon of a training episode for a steerable controller in which the turning radius of a each gait-cycle is selected randomly. Panel B depicts training a policy under an explicit, hand-designed transition graph for $k$ options.
  • Figure 2: Cold-switching among a set of behaviors (A) only at end of clips to form a trajectory composed of sequentially activation of the policies (B). Alternatively, policies are fragmented at a pre-specified set of times, cutting the policy into sub-policies (C), which serve as control fragments, enabling sequencing at a higher frequency (D).
  • Figure 3: Schematic of the architecture: a high-level controller (HL) selects among multiple low-level (LL) control fragments, which are policies with proprioception. Switching from one control fragment to another occurs every $k$ time steps.
  • Figure 4: A. Go-to-target: in this task, the agent moves on an open plane to a target provided in egocentric coordinates. B. Walls: The agent runs forward while avoiding solid walls using vision. C. Gaps: The agent runs forward and must jump between platforms to advance. D. Forage: Using vision, the agent roams in a procedurally-generated maze to collect balls, which provide sparse rewards. E. Heterogeneous Forage: The agent must probe and remember rewards that are randomly assigned to the balls in each episode.
  • Figure 5: Performance of various approaches on each core task. Of the approaches we compared, discrete switching among control fragments performed the best. Plots show the mean and standard error over multiple runs.
  • ...and 6 more figures