Table of Contents
Fetching ...

Chasing Autonomy: Dynamic Retargeting and Control Guided RL for Performant and Controllable Humanoid Running

Zachary Olkin, William D. Compton, Ryan M. Bena, Aaron D. Ames

Abstract

Humanoid robots have the promise of locomoting like humans, including fast and dynamic running. Recently, reinforcement learning (RL) controllers that can mimic human motions have become popular as they can generate very dynamic behaviors, but they are often restricted to single motion play-back which hinders their deployment in long duration and autonomous locomotion. In this paper, we present a pipeline to dynamically retarget human motions through an optimization routine with hard constraints to generate improved periodic reference libraries from a single human demonstration. We then study the effect of both the reference motion and the reward structure on the reference and commanded velocity tracking, concluding that a goal-conditioned and control-guided reward which tracks dynamically optimized human data results in the best performance. We deploy the policy on hardware, demonstrating its speed and endurance by achieving running speeds of up to 3.3 m/s on a Unitree G1 robot and traversing hundreds of meters in real-world environments. Additionally, to demonstrate the controllability of the locomotion, we use the controller in a full perception and planning autonomy stack for obstacle avoidance while running outdoors.

Chasing Autonomy: Dynamic Retargeting and Control Guided RL for Performant and Controllable Humanoid Running

Abstract

Humanoid robots have the promise of locomoting like humans, including fast and dynamic running. Recently, reinforcement learning (RL) controllers that can mimic human motions have become popular as they can generate very dynamic behaviors, but they are often restricted to single motion play-back which hinders their deployment in long duration and autonomous locomotion. In this paper, we present a pipeline to dynamically retarget human motions through an optimization routine with hard constraints to generate improved periodic reference libraries from a single human demonstration. We then study the effect of both the reference motion and the reward structure on the reference and commanded velocity tracking, concluding that a goal-conditioned and control-guided reward which tracks dynamically optimized human data results in the best performance. We deploy the policy on hardware, demonstrating its speed and endurance by achieving running speeds of up to 3.3 m/s on a Unitree G1 robot and traversing hundreds of meters in real-world environments. Additionally, to demonstrate the controllability of the locomotion, we use the controller in a full perception and planning autonomy stack for obstacle avoidance while running outdoors.

Paper Structure

This paper contains 21 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Demonstration of the running controller working inside on a constrained treadmill and in outdoor real world environments. The running appears human-like while still achieving the commanded speed through a combination of optimized retargeted human data and control guided reward shaping.
  • Figure 2: Architecture of the proposed pipeline. Human data is retargeted using state constrained dynamic optimization with a hybrid system model. This allows us to build a library of references from a single human reference and then track this in the RL controller. The RL controller uses a control-guided cost function. Then the policy is transferred zero-shot to hardware and can be used in an autonomy stack.
  • Figure 3: Visual depiction of the multiple shooting trajectory optimization problem. Two hybrid domains are shown and the node where they meet utilizes the associated reset map. Within each domain, a number of optimization nodes are used. The cost is tracking the human reference motion.
  • Figure 4: Reference tracking comparison between reference and rewards types. Error bars show one std. dev. The dynamically optimized human data performs the best, and the CLF rewards outperform the Mimic rewards. Optimized retargeted references outperform kinematic human data retargeting.
  • Figure 5: Outdoor long range experiments. The robot runs outdoors over distances of more than 150 m. The robot was given yaw targets in the global frame, a target position to track with body lateral motion and a feedforward velocity in the body forward direction. Lidar odometry and a P controller were used for global pose tracking.
  • ...and 1 more figures