Table of Contents
Fetching ...

Learning to Bridge the Gap: Efficient Novelty Recovery with Planning and Reinforcement Learning

Alicia Li, Nishanth Kumar, Tomás Lozano-Pérez, Leslie Kaelbling

TL;DR

This work proposes to learn a ``bridge policy'' via Reinforcement Learning (RL) to adapt to novelties and demonstrates that this approach is able to learn policies that adapt to novelty more efficiently than several baselines, including a pure RL baseline.

Abstract

The real world is unpredictable. Therefore, to solve long-horizon decision-making problems with autonomous robots, we must construct agents that are capable of adapting to changes in the environment during deployment. Model-based planning approaches can enable robots to solve complex, long-horizon tasks in a variety of environments. However, such approaches tend to be brittle when deployed into an environment featuring a novel situation that their underlying model does not account for. In this work, we propose to learn a ``bridge policy'' via Reinforcement Learning (RL) to adapt to such novelties. We introduce a simple formulation for such learning, where the RL problem is constructed with a special ``CallPlanner'' action that terminates the bridge policy and hands control of the agent back to the planner. This allows the RL policy to learn the set of states in which querying the planner and following the returned plan will achieve the goal. We show that this formulation enables the agent to rapidly learn by leveraging the planner's knowledge to avoid challenging long-horizon exploration caused by sparse reward. In experiments across three different simulated domains of varying complexity, we demonstrate that our approach is able to learn policies that adapt to novelty more efficiently than several baselines, including a pure RL baseline. We also demonstrate that the learned bridge policy is generalizable in that it can be combined with the planner to enable the agent to solve more complex tasks with multiple instances of the encountered novelty.

Learning to Bridge the Gap: Efficient Novelty Recovery with Planning and Reinforcement Learning

TL;DR

This work proposes to learn a ``bridge policy'' via Reinforcement Learning (RL) to adapt to novelties and demonstrates that this approach is able to learn policies that adapt to novelty more efficiently than several baselines, including a pure RL baseline.

Abstract

The real world is unpredictable. Therefore, to solve long-horizon decision-making problems with autonomous robots, we must construct agents that are capable of adapting to changes in the environment during deployment. Model-based planning approaches can enable robots to solve complex, long-horizon tasks in a variety of environments. However, such approaches tend to be brittle when deployed into an environment featuring a novel situation that their underlying model does not account for. In this work, we propose to learn a ``bridge policy'' via Reinforcement Learning (RL) to adapt to such novelties. We introduce a simple formulation for such learning, where the RL problem is constructed with a special ``CallPlanner'' action that terminates the bridge policy and hands control of the agent back to the planner. This allows the RL policy to learn the set of states in which querying the planner and following the returned plan will achieve the goal. We show that this formulation enables the agent to rapidly learn by leveraging the planner's knowledge to avoid challenging long-horizon exploration caused by sparse reward. In experiments across three different simulated domains of varying complexity, we demonstrate that our approach is able to learn policies that adapt to novelty more efficiently than several baselines, including a pure RL baseline. We also demonstrate that the learned bridge policy is generalizable in that it can be combined with the planner to enable the agent to solve more complex tasks with multiple instances of the encountered novelty.
Paper Structure (13 sections, 2 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 2 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: Our approach in Light Switch Door. (a): During deployment, the robot encounters an unknown object (door), when trying to turn on the light. This renders it unable to follow its plan. (b): The robot switches from planning to RL to learn how to overcome the novelty. (c): During evaluation, the robot switches from following its plan to following the learned 'bridge policy' each time it gets stuck at a door, and then switches back to the planner once it opens the door. It is thus able to generalize to opening an arbitrary number of doors.
  • Figure 2: Visualizations of our different experimental domains.
  • Figure 3: Experiments in our different domains. Results from both evaluation time and training time. We average smooth reward, the reward averaged over the last 25 online learning cycles, across all seeds and graph the variance.
  • Figure : Approach