Table of Contents
Fetching ...

MuJoCo MPC for Humanoid Control: Evaluation on HumanoidBench

Moritz Meser, Aditya Bhatt, Boris Belousov, Jan Peters

TL;DR

The paper tackles sparse reward issues in HumanoidBench by applying MuJoCo MPC (MJPC) with a shaped reward framework. It transforms the HumanoidBench reward into a cost $c_{\text{hb}}(x,u)=|r_{\max}-r_{\text{hb}}|$ with $r_{\max}=1$ and augments the objective with a finite-horizon cost $c(x,u)=\sum_i w_i \cdot n_i(c_i(x,u))$, supplemented by seven stability terms and three dense residuals. This approach yields higher HumanoidBench scores while maintaining realistic postures and smoother control signals, and the authors advocate longer, repeated episodes for robust evaluation. The contributions include the shaped reward design, an extended evaluation protocol, planner analysis, and public release of code for MJPC-based humanoid control.

Abstract

We tackle the recently introduced benchmark for whole-body humanoid control HumanoidBench using MuJoCo MPC. We find that sparse reward functions of HumanoidBench yield undesirable and unrealistic behaviors when optimized; therefore, we propose a set of regularization terms that stabilize the robot behavior across tasks. Current evaluations on a subset of tasks demonstrate that our proposed reward function allows achieving the highest HumanoidBench scores while maintaining realistic posture and smooth control signals. Our code is publicly available and will become a part of MuJoCo MPC, enabling rapid prototyping of robot behaviors.

MuJoCo MPC for Humanoid Control: Evaluation on HumanoidBench

TL;DR

The paper tackles sparse reward issues in HumanoidBench by applying MuJoCo MPC (MJPC) with a shaped reward framework. It transforms the HumanoidBench reward into a cost with and augments the objective with a finite-horizon cost , supplemented by seven stability terms and three dense residuals. This approach yields higher HumanoidBench scores while maintaining realistic postures and smoother control signals, and the authors advocate longer, repeated episodes for robust evaluation. The contributions include the shaped reward design, an extended evaluation protocol, planner analysis, and public release of code for MJPC-based humanoid control.

Abstract

We tackle the recently introduced benchmark for whole-body humanoid control HumanoidBench using MuJoCo MPC. We find that sparse reward functions of HumanoidBench yield undesirable and unrealistic behaviors when optimized; therefore, we propose a set of regularization terms that stabilize the robot behavior across tasks. Current evaluations on a subset of tasks demonstrate that our proposed reward function allows achieving the highest HumanoidBench scores while maintaining realistic posture and smooth control signals. Our code is publicly available and will become a part of MuJoCo MPC, enabling rapid prototyping of robot behaviors.
Paper Structure (5 sections, 3 figures, 2 tables)

This paper contains 5 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Performance comparison between the proposed controller MPC-ours that leverages shaped reward functions and the baselines: MPC-hb, which uses the HumanoidBench reward function, and the RL baselines. The $y$-axis shows the HumanoidBench score given by the sum of rewards over a trajectory, with maximum $1000$ for each task. For MPC methods, we employ the iLQG planner on the Stand and Walk tasks, and the Sampling planner on the Push task which involves a lot of contact interactions. Results from $6$ runs of each MPC method are reported. RL baseline results are imported from howell2022predictive where $3$ runs of each method are reported; we take the best policy from each run. MPC-ours significantly outperforms the baselines across tasks.
  • Figure 2: Behavior comparison between MPC-ours (top row) and MPC-hb (bottom row) on the Push task. Unlike our shaped reward, which encourages posture maintenance and balance, the HumanoidBench reward puts all emphasis on reaching the target box location as fast as possible, driving the robot into unrecoverable postures and thereby precluding further tasks.
  • Figure 3: Influence of the episode length on evaluation scores shown on the Walk task over $6$ runs. An episode would normally stop at the $2$s mark (vertical dotted line), yielding a relatively high cumulative reward for MPC-hb (green curve) thanks to good initial performance. However, when extended further, the median HumanoidBench reward drops almost to zero, while ours maintains the maximum value of $1$.