Preference-Conditioned Multi-Objective RL for Integrated Command Tracking and Force Compliance in Humanoid Locomotion
Tingxuan Leng, Yushi Wang, Tinglong Zheng, Changsheng Luo, Mingguo Zhao
TL;DR
The paper addresses the conflict between velocity command tracking and external force compliance in humanoid locomotion. It introduces a preference-conditioned multi-objective RL framework that maps external forces to equivalent velocity via a velocity–resistance model, enabling a single policy to interpolate between rigid tracking and compliant guidance, e.g. $v_{ext} = k \cdot F_{ext}$ and $F_{res} = -B v$. An encoder–decoder learns force-relevant latent features from deployable observations, and policy optimization uses PPO with a MORL reward vector $\mathbf{r}=[r_c,r_f,r_r]$ and a preference vector $\mathbf{w}$. Experiments in simulation and on Booster T1 hardware demonstrate deployable omnidirectional locomotion, online preference switching, and improved robustness over single-objective baselines.
Abstract
Humanoid locomotion requires not only accurate command tracking for navigation but also compliant responses to external forces during human interaction. Despite significant progress, existing RL approaches mainly emphasize robustness, yielding policies that resist external forces but lack compliance-particularly challenging for inherently unstable humanoids. In this work, we address this by formulating humanoid locomotion as a multi-objective optimization problem that balances command tracking and external force compliance. We introduce a preference-conditioned multi-objective RL (MORL) framework that integrates rigid command following and compliant behaviors within a single omnidirectional locomotion policy. External forces are modeled via velocity-resistance factor for consistent reward design, and training leverages an encoder-decoder structure that infers task-relevant privileged features from deployable observations. We validate our approach in both simulation and real-world experiments on a humanoid robot. Experimental results indicate that our framework not only improves adaptability and convergence over standard pipelines, but also realizes deployable preference-conditioned humanoid locomotion.
