Table of Contents
Fetching ...

Learning Risk-Aware Quadrupedal Locomotion using Distributional Reinforcement Learning

Lukas Schneider, Jonas Frey, Takahiro Miki, Marco Hutter

TL;DR

This work addresses the lack of explicit risk handling in legged locomotion by introducing Distributional Proximal Policy Optimization (DPPO), which learns a full return distribution $Z_{theta}(s)$ via QR-DQN and distorts it with risk metrics (CVaR and Wang) parameterized by $β$ to obtain risk-sensitive estimates $V_β(s)$. The policy is conditioned on $β$ and updated with a clipped PPO objective using risk-adjusted advantages, enabling online risk adaptation. Key contributions include explicit risk modeling in locomotion without reward shaping, demonstration of emergent risk-sensitive behavior in both simulation and hardware (ANYmal), and ablations showing Wang distortion provides robust performance across risk preferences. This framework supports dynamic risk-aware control for teleoperation and navigation in hazardous environments, with potential for integration into higher-level planning.

Abstract

Deployment in hazardous environments requires robots to understand the risks associated with their actions and movements to prevent accidents. Despite its importance, these risks are not explicitly modeled by currently deployed locomotion controllers for legged robots. In this work, we propose a risk sensitive locomotion training method employing distributional reinforcement learning to consider safety explicitly. Instead of relying on a value expectation, we estimate the complete value distribution to account for uncertainty in the robot's interaction with the environment. The value distribution is consumed by a risk metric to extract risk sensitive value estimates. These are integrated into Proximal Policy Optimization (PPO) to derive our method, Distributional Proximal Policy Optimization (DPPO). The risk preference, ranging from risk-averse to risk-seeking, can be controlled by a single parameter, which enables to adjust the robot's behavior dynamically. Importantly, our approach removes the need for additional reward function tuning to achieve risk sensitivity. We show emergent risk sensitive locomotion behavior in simulation and on the quadrupedal robot ANYmal. Videos of the experiments and code are available at https://sites.google.com/leggedrobotics.com/risk-aware-locomotion.

Learning Risk-Aware Quadrupedal Locomotion using Distributional Reinforcement Learning

TL;DR

This work addresses the lack of explicit risk handling in legged locomotion by introducing Distributional Proximal Policy Optimization (DPPO), which learns a full return distribution via QR-DQN and distorts it with risk metrics (CVaR and Wang) parameterized by to obtain risk-sensitive estimates . The policy is conditioned on and updated with a clipped PPO objective using risk-adjusted advantages, enabling online risk adaptation. Key contributions include explicit risk modeling in locomotion without reward shaping, demonstration of emergent risk-sensitive behavior in both simulation and hardware (ANYmal), and ablations showing Wang distortion provides robust performance across risk preferences. This framework supports dynamic risk-aware control for teleoperation and navigation in hazardous environments, with potential for integration into higher-level planning.

Abstract

Deployment in hazardous environments requires robots to understand the risks associated with their actions and movements to prevent accidents. Despite its importance, these risks are not explicitly modeled by currently deployed locomotion controllers for legged robots. In this work, we propose a risk sensitive locomotion training method employing distributional reinforcement learning to consider safety explicitly. Instead of relying on a value expectation, we estimate the complete value distribution to account for uncertainty in the robot's interaction with the environment. The value distribution is consumed by a risk metric to extract risk sensitive value estimates. These are integrated into Proximal Policy Optimization (PPO) to derive our method, Distributional Proximal Policy Optimization (DPPO). The risk preference, ranging from risk-averse to risk-seeking, can be controlled by a single parameter, which enables to adjust the robot's behavior dynamically. Importantly, our approach removes the need for additional reward function tuning to achieve risk sensitivity. We show emergent risk sensitive locomotion behavior in simulation and on the quadrupedal robot ANYmal. Videos of the experiments and code are available at https://sites.google.com/leggedrobotics.com/risk-aware-locomotion.
Paper Structure (9 sections, 7 equations, 7 figures, 2 tables)

This paper contains 9 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Our robot learns to adapt its locomotion behavior in risky situations. When commanded to walk up a large step, the risk-averse policy ($\blacksquare$) refuses while the risk-seeking policy ($\blacksquare$) complies. The risk parameter, controlling the value distribution distortion, can be adapted online during deployment. Bottom: Respective risk-metric distorted value distribution per robot.
  • Figure 2: Architecture overview. The critic learns to predict a value distribution, used in combination with a risk metric to update the policy. The policy is conditioned on the risk parameter. The risk parameter is part of the command, set by the operator.
  • Figure 3: Application of a risk metric. The risk sensitivity selects how the value distribution is distorted. The mean value of the distorted distribution is provided for Generalized Advantage Estimation.
  • Figure 4: Robot operated remotely along an obstacle course. Depending on the chosen path, different risk sensitivities are preferable. For the incline along path (a), a risk-seeking policy ($\blacksquare$) can be chosen for increased walking speed. Using a risk-averse policy ($\blacksquare$) when descending the stairs along route (b) ensures the robot's safety. A risk-averse policy ($\blacksquare$) won't climb the dangerous obstacle (c) and thus would have to walk around it, along route (d). Meanwhile, setting the policy to risk-seeking ($\blacksquare$) allows the robot to surmount the obstacle (c). To step down into the deep pit along route (e), one must set the sensitivity to risk-seeking ($\blacksquare$). The risk-averse policy ($\blacksquare$) will refuse to step into the pit (f) as it may lead to a crash. Video: https://youtu.be/GGFXpF4qeVY.
  • Figure 5: Average return in the evaluation environment. Shaded regions indicate $95\%$ confidence intervals across seeds and evaluation spawns. Hyperparameters for DPPO were not tuned.
  • ...and 2 more figures