Table of Contents
Fetching ...

Preference-Conditioned Multi-Objective RL for Integrated Command Tracking and Force Compliance in Humanoid Locomotion

Tingxuan Leng, Yushi Wang, Tinglong Zheng, Changsheng Luo, Mingguo Zhao

TL;DR

The paper addresses the conflict between velocity command tracking and external force compliance in humanoid locomotion. It introduces a preference-conditioned multi-objective RL framework that maps external forces to equivalent velocity via a velocity–resistance model, enabling a single policy to interpolate between rigid tracking and compliant guidance, e.g. $v_{ext} = k \cdot F_{ext}$ and $F_{res} = -B v$. An encoder–decoder learns force-relevant latent features from deployable observations, and policy optimization uses PPO with a MORL reward vector $\mathbf{r}=[r_c,r_f,r_r]$ and a preference vector $\mathbf{w}$. Experiments in simulation and on Booster T1 hardware demonstrate deployable omnidirectional locomotion, online preference switching, and improved robustness over single-objective baselines.

Abstract

Humanoid locomotion requires not only accurate command tracking for navigation but also compliant responses to external forces during human interaction. Despite significant progress, existing RL approaches mainly emphasize robustness, yielding policies that resist external forces but lack compliance-particularly challenging for inherently unstable humanoids. In this work, we address this by formulating humanoid locomotion as a multi-objective optimization problem that balances command tracking and external force compliance. We introduce a preference-conditioned multi-objective RL (MORL) framework that integrates rigid command following and compliant behaviors within a single omnidirectional locomotion policy. External forces are modeled via velocity-resistance factor for consistent reward design, and training leverages an encoder-decoder structure that infers task-relevant privileged features from deployable observations. We validate our approach in both simulation and real-world experiments on a humanoid robot. Experimental results indicate that our framework not only improves adaptability and convergence over standard pipelines, but also realizes deployable preference-conditioned humanoid locomotion.

Preference-Conditioned Multi-Objective RL for Integrated Command Tracking and Force Compliance in Humanoid Locomotion

TL;DR

The paper addresses the conflict between velocity command tracking and external force compliance in humanoid locomotion. It introduces a preference-conditioned multi-objective RL framework that maps external forces to equivalent velocity via a velocity–resistance model, enabling a single policy to interpolate between rigid tracking and compliant guidance, e.g. and . An encoder–decoder learns force-relevant latent features from deployable observations, and policy optimization uses PPO with a MORL reward vector and a preference vector . Experiments in simulation and on Booster T1 hardware demonstrate deployable omnidirectional locomotion, online preference switching, and improved robustness over single-objective baselines.

Abstract

Humanoid locomotion requires not only accurate command tracking for navigation but also compliant responses to external forces during human interaction. Despite significant progress, existing RL approaches mainly emphasize robustness, yielding policies that resist external forces but lack compliance-particularly challenging for inherently unstable humanoids. In this work, we address this by formulating humanoid locomotion as a multi-objective optimization problem that balances command tracking and external force compliance. We introduce a preference-conditioned multi-objective RL (MORL) framework that integrates rigid command following and compliant behaviors within a single omnidirectional locomotion policy. External forces are modeled via velocity-resistance factor for consistent reward design, and training leverages an encoder-decoder structure that infers task-relevant privileged features from deployable observations. We validate our approach in both simulation and real-world experiments on a humanoid robot. Experimental results indicate that our framework not only improves adaptability and convergence over standard pipelines, but also realizes deployable preference-conditioned humanoid locomotion.

Paper Structure

This paper contains 22 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Preference-conditioned locomotion: A single policy realizes behaviors from command tracking to human-guided compliance by adjusting the preference. Arrows indicate velocity command (blue), external force (red), and resulting humanoid velocity (yellow).
  • Figure 2: Policy training framework with privileged denoising: An asymmetric actor–critic architecture is extended with an encoder–decoder that reconstructs privileged observations, guiding the encoder to extract force- and torque-aware latent features. At deployment, only encoder and actor remain, enabling preference-conditioned control with onboard observations, latent embedding and preference vector.
  • Figure 3: Effect of reward weight $\mathbf{w}$ in the opposite setting: Each trial lasts 5 s, and the baseline policy result is plotted on the rightmost side for comparison. (a)(b) Pareto-front visualization of linear and angular velocity. (c) average forward velocity $\bar{v}_x$ with $v_{c,x}=1.0 \ \mathrm{m/s}$ under three levels of backward force (10 N, 20 N, 30 N). (d) average yaw angular velocity $\omega_{yaw}$ with $\omega_{c}=1.0 \ \mathrm{rad/s}$under three levels of applied torque (3 Nm, 5 Nm, 7 Nm).
  • Figure 4: Effect of reward weight $\mathbf{w}$ in the orthogonal setting: Each trial lasts 5 s, and the baseline policy result is plotted for comparison. (a) Pareto-front visualization of linear velocity. (b) velocity $v_x$ and $v_y$ with $v_{c,x}=1.0 \ \mathrm{m/s}$ under a leftward force (30 N).
  • Figure 5: Online switching of command weight: The trajectory lasts for 12 seconds with the command weight changed every 4 seconds. The robot is applied constant $v_{c,x} = 1.0 \mathrm{m/s}, F_{\text{ext}}=20\mathrm{N}, \omega_c = 1.0 \mathrm{rad/s}, \tau=5\mathrm{N \cdot m}$ separately. The policy's behavior changes corresponding to the command weight.
  • ...and 4 more figures