Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs

Max Weltevrede; Caroline Horsch; Matthijs T. J. Spaan; Wendelin Böhmer

Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs

Max Weltevrede, Caroline Horsch, Matthijs T. J. Spaan, Wendelin Böhmer

TL;DR

The paper tackles zero-shot policy transfer in contextual MDPs by examining how exploration affects generalisation. It introduces reachability to distinguish between training-visible and unseen contexts and identifies a trade-off between exploration-induced coverage and value-target accuracy. The proposed Explore-Go method inserts a pure exploration phase at the start of each episode to expand the set of reachable contexts while preserving accurate targets, and it proves effective across on- and off-policy algorithms and in partially observable settings. Empirically, Explore-Go improves generalisation to unreachable contexts across several benchmarks and baselines, outperforming purely exploratory methods and offering a simple, broadly applicable modification for practitioners.

Abstract

In the zero-shot policy transfer (ZSPT) setting for contextual Markov decision processes (MDP), agents train on a fixed set of contexts and must generalise to new ones. Recent work has argued and demonstrated that increased exploration can improve this generalisation, by training on more states in the training contexts. In this paper, we demonstrate that training on more states can indeed improve generalisation, but can come at a cost of reducing the accuracy of the learned value function which should not benefit generalisation. We hypothesise and demonstrate that using exploration to increase the agent's coverage while also increasing the accuracy improves generalisation even more. Inspired by this, we propose a method Explore-Go that implements an exploration phase at the beginning of each episode, which can be combined with existing on- and off-policy RL algorithms and significantly improves generalisation even in partially observable MDPs. We demonstrate the effectiveness of Explore-Go when combined with several popular algorithms and show an increase in generalisation performance across several environments. With this, we hope to provide practitioners with a simple modification that can improve the generalisation of their agents.

Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs

TL;DR

Abstract

Paper Structure (30 sections, 1 theorem, 1 equation, 12 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 1 theorem, 1 equation, 12 figures, 5 tables, 2 algorithms.

Introduction
Background
Contextual Markov decision process
How exploration can improve generalisation
Reachability in the ZSPT setting
Generalisation to unreachable contexts
The issue with exploration-induced data augmentation
Explore-Go: training on more reachable contexts with accurate targets
Experiments
Explore-Go with off- and on-policy algorithms
The effect of different exploration approaches on generalisation
Explore-Go on partially observable environments
Related work
Conclusion
Related work
...and 15 more sections

Key Result

Corollary 1

An optimal policy $\pi$ that achieves maximal return from any state in the reachable state space $S_r(\mathcal{M}|_{S_0^{train}})$, will have optimal performance in the reachable generalisation setting.

Figures (12)

Figure 1: Left: One training context in the Four Rooms Minigrid environment with examples of random transitions (pink arrows). Middle: train and test performance in for DQN (red), DQN with a small fraction of random transitions added to the replay buffer (green), and DQN+Explore-Go (our method, blue). Right: the error between the estimated Q-value and the optimal value, averaged over all states in the training contexts. Shown are mean and 95% confidence intervals over 20 seeds.
Figure 2: Illustrative CMDP with four training contexts (top), each with a different background colour and starting position (circle). All contexts must reach the green square in the middle. Testing is done in an unreachable context with completely different background colour (white). (a) The optimal trajectories, here sorted by their context (rows) and their optimal action (columns), show a clear spurious correlation with the background colour. (b) This correlation vanishes when training on all reachable states. (c) Performance of a baseline PPO agent and our Explore-Go agent on the CMDP. The agent trains on the four contexts and is tested in contexts with a completely new background colour. Shown are the mean and 95% confidence interval over 100 seeds.
Figure 3: Training and unreachable test performance of Explore-Go in the Four Rooms environment. Mean and 95% confidence intervals for 20 seeds.
Figure 4: Comparing DQN, DQN+Explore-Go, DQN+TEE and DQN with increased exploration ($\beta= 0.5$) in Four Rooms: (a) train and test performance, (b) fraction of reachable state-action space explored, (c) fraction of reachable state space currently in the buffer and (d) true value error averaged of the entire reachable state space. Shown are the mean and 95% confidence intervals over 20 seeds.
Figure 5: The partially observable ViZDoom "My Way Home" CMDP (top). Mean and 95% confidence intervals for 20 seeds of APPO and APPO+Explore-Go.
...and 7 more figures

Theorems & Definitions (2)

Definition 1: Reachable/Unreachable generalisation
Corollary 1

Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs

TL;DR

Abstract

Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (2)