Table of Contents
Fetching ...

Risk-Aware Reinforcement Learning for Mobile Manipulation

Michael Groom, James Wilson, Nick Hawes, Lars Kunze

TL;DR

This work is the first to learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and is the first to show risk-aware behaviours can be transferred through Imitation Learning to a visuomotor policy conditioned on egocentric depth observations.

Abstract

For robots to successfully transition from lab settings to everyday environments, they must begin to reason about the risks associated with their actions and make informed, risk-aware decisions. This is particularly true for robots performing mobile manipulation tasks, which involve both interacting with and navigating within dynamic, unstructured spaces. However, existing whole-body controllers for mobile manipulators typically lack explicit mechanisms for risk-sensitive decision-making under uncertainty. To our knowledge, we are the first to (i) learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and (ii) show risk-aware behaviours can be transferred through Imitation Learning (IL) to a visuomotor policy conditioned on egocentric depth observations. Our method achieves this by first training a privileged teacher policy using Distributional Reinforcement Learning (DRL), with a risk-neutral distributional critic. Distortion risk-metrics are then applied to the critic's predicted return distribution to calculate risk-adjusted advantage estimates used in policy updates to achieve a range of risk-aware behaviours. We then distil teacher policies with IL to obtain risk-aware student policies conditioned on egocentric depth observations. We perform extensive evaluations demonstrating that our trained visuomotor policies exhibit risk-aware behaviour (specifically achieving better worst-case performance) while performing reactive whole-body motions in unmapped environments, leveraging live depth observations for perception.

Risk-Aware Reinforcement Learning for Mobile Manipulation

TL;DR

This work is the first to learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and is the first to show risk-aware behaviours can be transferred through Imitation Learning to a visuomotor policy conditioned on egocentric depth observations.

Abstract

For robots to successfully transition from lab settings to everyday environments, they must begin to reason about the risks associated with their actions and make informed, risk-aware decisions. This is particularly true for robots performing mobile manipulation tasks, which involve both interacting with and navigating within dynamic, unstructured spaces. However, existing whole-body controllers for mobile manipulators typically lack explicit mechanisms for risk-sensitive decision-making under uncertainty. To our knowledge, we are the first to (i) learn risk-aware visuomotor policies for mobile manipulation conditioned on egocentric depth observations with runtime-adjustable risk sensitivity, and (ii) show risk-aware behaviours can be transferred through Imitation Learning (IL) to a visuomotor policy conditioned on egocentric depth observations. Our method achieves this by first training a privileged teacher policy using Distributional Reinforcement Learning (DRL), with a risk-neutral distributional critic. Distortion risk-metrics are then applied to the critic's predicted return distribution to calculate risk-adjusted advantage estimates used in policy updates to achieve a range of risk-aware behaviours. We then distil teacher policies with IL to obtain risk-aware student policies conditioned on egocentric depth observations. We perform extensive evaluations demonstrating that our trained visuomotor policies exhibit risk-aware behaviour (specifically achieving better worst-case performance) while performing reactive whole-body motions in unmapped environments, leveraging live depth observations for perception.
Paper Structure (9 sections, 4 equations, 8 figures, 5 tables)

This paper contains 9 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Predicted value distributions from a risk-aware critic during a successful pick attempt at key time steps (PDFs via critic-predicted quantiles) shown for different risk attitudes: risk-neutral ($\beta=0.0$), risk-seeking ($\beta=-1.0$), and risk-averse ($\beta=+1.0$). Risk sensitivity alters the perceived relative likelihood of outcomes, producing risk-aware behaviour. Note changing axis ranges.
  • Figure 2: An overview of our proposed framework. Phase 1: A risk-aware DRL teacher policy $\pi_{\theta}$ is trained. A critic predicts value distributions $Z_{\phi}(s)$ which are distorted by a risk metric to calculate a risk distorted expected value, which is used to update the teacher policy $\pi_{\theta}$, which is also conditioned on the selected risk-sensitivity. Phase 2: A student policy $\pi_{\psi}$ conditioned on high-dimensional depth observations is learnt through IL with the risk-aware teacher policy $\pi_{\theta}$. The risk-sensitivity parameter $\beta$ is assumed to be provided by an external operator or planner at runtime.
  • Figure 3: Training environments. Left: Navigation task, reaching a 3D target (axes) while avoiding static and dynamic obstacles. Right: Pick task, grasping and lifting a cube to a goal (red sphere).
  • Figure 4: Training curves for (a) navigation and (b) pick tasks. Teachers evaluate $16\times$ more steps per episode due to higher parallel environment counts ($4096$ vs. $256$). The drop at episode 600 in (b) marks the switch to student-driven environment stepping.
  • Figure 5: Task rates for policies evaluated on the navigation and pick tasks. (a) Left to right: task success rate; contact rate with the environment and the dynamic obstacle; cumulative return. (b) Left to right: task success rate; time out rate (not termination); time to reach goal success; time to task failure termination.
  • ...and 3 more figures