Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking

Bart van Marum; Aayam Shrestha; Helei Duan; Pranay Dugar; Jeremy Dao; Alan Fern

Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking

Bart van Marum, Aayam Shrestha, Helei Duan, Pranay Dugar, Jeremy Dao, Alan Fern

TL;DR

This work addresses the need for repeatable real-world evaluation of humanoid standing and walking (SaW) controllers trained via sim-to-real reinforcement learning. It introduces a low-cost benchmarking framework covering disturbance rejection, command following, and energy efficiency, and demonstrates how these metrics reveal hidden weaknesses in reward designs. By proposing a minimally constraining SaW reward and evaluating against a manufacturer controller and a clock-based RL controller on the Digit robot, the authors identify clear trade-offs and drive targeted improvements, yielding a more robust SaW controller (Single Contact++ RL). The findings underscore the importance of systematic, real-world benchmarking for progressing reliable humanoid locomotion and guide future work on energy efficiency and smoother transitions between standing and walking.

Abstract

A necessary capability for humanoid robots is the ability to stand and walk while rejecting natural disturbances. Recent progress has been made using sim-to-real reinforcement learning (RL) to train such locomotion controllers, with approaches differing mainly in their reward functions. However, prior works lack a clear method to systematically test new reward functions and compare controller performance through repeatable experiments. This limits our understanding of the trade-offs between approaches and hinders progress. To address this, we propose a low-cost, quantitative benchmarking method to evaluate and compare the real-world performance of standing and walking (SaW) controllers on metrics like command following, disturbance recovery, and energy efficiency. We also revisit reward function design and construct a minimally constraining reward function to train SaW controllers. We experimentally verify that our benchmarking framework can identify areas for improvement, which can be systematically addressed to enhance the policies. We also compare our new controller to state-of-the-art controllers on the Digit humanoid robot. The results provide clear quantitative trade-offs among the controllers and suggest directions for future improvements to the reward functions and expansion of the benchmarks.

Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking

TL;DR

Abstract

Paper Structure (15 sections, 5 figures, 1 table)

This paper contains 15 sections, 5 figures, 1 table.

Introduction
Problem Statement and Related Work
Quantitative SaW Performance Benchmark
Disturbance Rejection
Command Following
Energy Efficiency
SaW Training and Reward Design
Architecture and Training Framework
Reward Design
Evaluation Results
Disturbance Rejection While Standing
Benchmark Guided Improvement
Command Following
Energy Efficiency
Summary

Figures (5)

Figure 1: We propose a set of metrics with an easy-to-setup testing fixture and provide quantitative results towards the controller performance in the real-world. Our proposed RL-based method produces a robust standing-and-walking controller for the humanoid robot Digit. The learned controller can handle a set of significant amount of disturbances, such as lateral push at 150N for 500ms shown in A and sagittal push at 200N for 500ms shown in B. The controller is able to walk, stand, and seamlessly transition between these two settings.
Figure 2: An impulse is applied to the robot by means of a weight connected by a rope. Force $\boldsymbol{F}$ is regulated by adding and removing weight. Duration $\Delta t$ is regulated by a microcontroller that automatically disconnects the weight from the rope, after a set amount of time. The rope is always attached to Digit at the same height of 122 cm.
Figure 3: Disturbance rejection success rates for various humanoid SaW controllers in the $x$-direction (left) and $y$-direction (right). Results show that our Single Contact++ reward function outperforms competing alternatives. Our Single Contact controller shows asymmetric and non-monotonic results in the $y$-direction, emphasizing the importance of systematic evaluation.
Figure 4: Command following accuracy for turning in place. Error bars are standard deviation. Also note that the 30 seconds drift results for Agility Controller were in some cases helped by the robot tether. It is safe to assume results without tether would have been closer to the upper end of the error bar.
Figure 5: Approximation of power consumption for a commanded run of 1 m/s for 10 seconds. Policies start in standing mode, and end in standing mode. * Note that results for Single Contact++ RL Controller are missing due to an experiment damaging the robot close to submission.

Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking

TL;DR

Abstract

Revisiting Reward Design and Evaluation for Robust Humanoid Standing and Walking

Authors

TL;DR

Abstract

Table of Contents

Figures (5)