Table of Contents
Fetching ...

Gain Tuning Is Not What You Need: Reward Gain Adaptation for Constrained Locomotion Learning

Arthicha Srisuchinnawong, Poramate Manoonpong

TL;DR

ROGER introduces Reward-Oriented Gains via Embodied Regulation, an online gain-adaptation mechanism for constrained locomotion learning. By maintaining multiple reward channels and updating weighting gains according to proximity to constraint thresholds, ROGER achieves near-zero constraint violations while improving the primary reward, demonstrated on a heavy quadruped in simulation and in real-world training within about an hour, and on MuJoCo Hopper with substantial gains in distance and reductions in torque and orientation deviation. The method reduces the need for manual reward-tuning, provides partial stability guarantees, and shows strong generalization across sim-to-real settings and diverse locomotion tasks, indicating practical potential for safe continual robot learning in the real world. The approach contributes a simple, embodied-rule for balancing safety and performance without extensive hyperparameter searches, paving the way for constraint-satisfying continuous locomotion learning in physical robots.

Abstract

Existing robot locomotion learning techniques rely heavily on the offline selection of proper reward weighting gains and cannot guarantee constraint satisfaction (i.e., constraint violation) during training. Thus, this work aims to address both issues by proposing Reward-Oriented Gains via Embodied Regulation (ROGER), which adapts reward-weighting gains online based on penalties received throughout the embodied interaction process. The ratio between the positive reward (primary reward) and negative reward (penalty) gains is automatically reduced as the learning approaches the constraint thresholds to avoid violation. Conversely, the ratio is increased when learning is in safe states to prioritize performance. With a 60-kg quadruped robot, ROGER achieved near-zero constraint violation throughout multiple learning trials. It also achieved up to 50% more primary reward than the equivalent state-of-the-art techniques. In MuJoCo continuous locomotion benchmarks, including a single-leg hopper, ROGER exhibited comparable or up to 100% higher performance and 60% less torque usage and orientation deviation compared to those trained with the default reward function. Finally, real-world locomotion learning of a physical quadruped robot was achieved from scratch within one hour without any falls. Therefore, this work contributes to constraint-satisfying real-world continual robot locomotion learning and simplifies reward weighting gain tuning, potentially facilitating the development of physical robots and those that learn in the real world.

Gain Tuning Is Not What You Need: Reward Gain Adaptation for Constrained Locomotion Learning

TL;DR

ROGER introduces Reward-Oriented Gains via Embodied Regulation, an online gain-adaptation mechanism for constrained locomotion learning. By maintaining multiple reward channels and updating weighting gains according to proximity to constraint thresholds, ROGER achieves near-zero constraint violations while improving the primary reward, demonstrated on a heavy quadruped in simulation and in real-world training within about an hour, and on MuJoCo Hopper with substantial gains in distance and reductions in torque and orientation deviation. The method reduces the need for manual reward-tuning, provides partial stability guarantees, and shows strong generalization across sim-to-real settings and diverse locomotion tasks, indicating practical potential for safe continual robot learning in the real world. The approach contributes a simple, embodied-rule for balancing safety and performance without extensive hyperparameter searches, paving the way for constraint-satisfying continuous locomotion learning in physical robots.

Abstract

Existing robot locomotion learning techniques rely heavily on the offline selection of proper reward weighting gains and cannot guarantee constraint satisfaction (i.e., constraint violation) during training. Thus, this work aims to address both issues by proposing Reward-Oriented Gains via Embodied Regulation (ROGER), which adapts reward-weighting gains online based on penalties received throughout the embodied interaction process. The ratio between the positive reward (primary reward) and negative reward (penalty) gains is automatically reduced as the learning approaches the constraint thresholds to avoid violation. Conversely, the ratio is increased when learning is in safe states to prioritize performance. With a 60-kg quadruped robot, ROGER achieved near-zero constraint violation throughout multiple learning trials. It also achieved up to 50% more primary reward than the equivalent state-of-the-art techniques. In MuJoCo continuous locomotion benchmarks, including a single-leg hopper, ROGER exhibited comparable or up to 100% higher performance and 60% less torque usage and orientation deviation compared to those trained with the default reward function. Finally, real-world locomotion learning of a physical quadruped robot was achieved from scratch within one hour without any falls. Therefore, this work contributes to constraint-satisfying real-world continual robot locomotion learning and simplifies reward weighting gain tuning, potentially facilitating the development of physical robots and those that learn in the real world.

Paper Structure

This paper contains 24 sections, 18 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 2: Illustration of how embodied interaction between the robot and the environment can be used to train a control policy. The traditional RL loop is shown in gray, and the additional ROGER loop is shown in blue.
  • Figure 3: This adaptation strategy is also proven to be partially stable in key conditions (i.e., near the constraint threshold s and convergence), while the expected primary reward is guaranteed to increase, as detailed in Appendix \ref{['app:proof']}. As a result, if an optimal solution exists far from the constraint threshold s, such as body orientation in robot locomotion, ROGER will converge to that solution; otherwise, it chooses a safe alternative.
  • Figure 4: Locomotion learning framework for the Unitree B1 quadruped with ROGER. An adaptive neural control produces joint position targets that are used by low-level controllers. After execution, the robot receives rewards and penalties, which are then combined using ROGER and subsequently used to train the neural control.
  • Figure 5: (a) Final primary reward values obtained from the last training episode and (b-c) roll and pitch angles recorded throughout the locomotion learning of the simulated Unitree B1 quadruped robot. The robot was trained using six techniques: two fixed-weighting techniques in red (fixed-gain penalty and fixed-gain CBF), three adaptive weighting techniques in gray (PDO, CRPO, and OL-AUX), and ROGER in blue. All conditions are presented along with their kernel density estimation. In (b-c), red dashed lines indicate constraint threshold s at $\pm$ 0.2 rads, or approximately $\pm$ 10$^\circ$; therefore, the data points exceeding these lines indicate violations. A video of this experiment is available at https://youtu.be/cZ5qOw0i_T4.
  • Figure 6: Evolution of the main reward term across 500 learning episodes from the locomotion learning of the simulated Unitree B1 quadruped robot trained with (dark red) only the primary reward term, (light red) fixed gain penalty, (blue) ROGER, and (gray) CRPO.
  • ...and 14 more figures