Table of Contents
Fetching ...

Accelerating Residual Reinforcement Learning with Uncertainty Estimation

Lakshita Dodeja, Karl Schmeckpeper, Shivam Vats, Thomas Weng, Mingxi Jia, George Konidaris, Stefanie Tellex

Abstract

Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer. Paper homepage : lakshitadodeja.github.io/uncertainty-aware-residual-rl/

Accelerating Residual Reinforcement Learning with Uncertainty Estimation

Abstract

Residual Reinforcement Learning (RL) is a popular approach for adapting pretrained policies by learning a lightweight residual policy that provides corrective actions. While Residual RL is more sample-efficient than finetuning the entire base policy, existing methods struggle with sparse rewards and are designed for deterministic base policies. We propose two improvements to Residual RL that further enhance its sample efficiency and make it suitable for stochastic base policies. First, we leverage uncertainty estimates of the base policy to focus exploration on regions in which the base policy is not confident. Second, we propose a simple modification to off-policy residual learning that allows it to observe base actions and better handle stochastic base policies. We evaluate our method with both Gaussian-based and Diffusion-based stochastic base policies on tasks from Robosuite and D4RL, and compare against state-of-the-art finetuning methods, demo-augmented RL methods, and other residual RL methods. Our algorithm significantly outperforms existing baselines in a variety of simulation benchmark environments. We also deploy our learned polices in the real world to demonstrate their robustness with zero-shot sim-to-real transfer. Paper homepage : lakshitadodeja.github.io/uncertainty-aware-residual-rl/

Paper Structure

This paper contains 22 sections, 11 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: We propose two improvements to accelerate Residual RL : 1) We use uncertainty estimation to constrain exploration around the base policy. 2) We modify the off-policy critic to learn the $Q$ function for the combined action. We test our method with two uncertainty metrics, i.e. distance to data and ensemble variance.
  • Figure 2: We test our proposed approach on the Lift, Can, and Square tasks from Robosuite mandlekar2021matters and the Franka Kitchen Task from D4RL fu2020d4rl.
  • Figure 3: Results on Robosuite environments with a GMM base policy. Our method is able to outperform all other baselines in all tasks. The error bars indicate 95% confidence interval.
  • Figure 4: Results on Franka Kitchen and Robosuite environments with a Diffusion base policy. Our method is able to outperform all baselines for Kitchen Complete and Can task, and has comparable performance for Square Task. The error bars indicate 95% confidence interval.
  • Figure 5: Learning with either the residual or the complete action works well with deterministic base policies, but learning with complete action is required for stochastic base policies.
  • ...and 6 more figures