Table of Contents
Fetching ...

Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning

Max Weltevrede, Felix Kaubek, Matthijs T. J. Spaan, Wendelin Böhmer

TL;DR

Intentions why exploration can also benefit generalisation to states that cannot be explicitly encountered during training are provided and a novel method Explore-Go is proposed that exploits this intuition by increasing the number of states on which the agent trains.

Abstract

One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark.

Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning

TL;DR

Intentions why exploration can also benefit generalisation to states that cannot be explicitly encountered during training are provided and a novel method Explore-Go is proposed that exploits this intuition by increasing the number of states on which the agent trains.

Abstract

One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark.
Paper Structure (18 sections, 4 figures, 3 tables, 1 algorithm)

This paper contains 18 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Example of a reachable and unreachable task. The agent (circle) needs to move to the goal location (light green square). The reachable task on the right has a different start state, which can be reached from the training task. The unreachable task differs by the background and cannot be reached.
  • Figure 2: (a) Illustrative CMDP with four training tasks, each differing in background colour and agent (circle) starting position. All tasks share the same goal location (green square in the middle). (b) Performance of a baseline PPO agent and our Explore-Go agent on the CMDP. The agent trains on the tasks in (a) and is tested in tasks with a completely new background colour. Shown are mean and 95% confidence interval over 100 seeds. Below are (c) the states along the optimal trajectories, (d) the reachable state space, categorised by the task they're from (rows) and the optimal action (columns).
  • Figure 3: Performance of Explore-Go and PPO on the Procgen Benchmark. Shown are the mean and 95% confidence interval over 5 seeds.
  • Figure : PPO + Explore-Go

Theorems & Definitions (1)

  • Definition 1