Zero-Shot Reinforcement Learning from Low Quality Data

Scott Jeen; Tom Bewley; Jonathan M. Cullen

Zero-Shot Reinforcement Learning from Low Quality Data

Scott Jeen, Tom Bewley, Jonathan M. Cullen

TL;DR

This work explores how the performance of zero-shot RL methods degrades when trained on small homogeneous datasets, and proposes fixes inspired by conservatism, a well-established feature of performant single-task offline RL algorithms.

Abstract

Zero-shot reinforcement learning (RL) promises to provide agents that can perform any task in an environment after an offline, reward-free pre-training phase. Methods leveraging successor measures and successor features have shown strong performance in this setting, but require access to large heterogenous datasets for pre-training which cannot be expected for most real problems. Here, we explore how the performance of zero-shot RL methods degrades when trained on small homogeneous datasets, and propose fixes inspired by conservatism, a well-established feature of performant single-task offline RL algorithms. We evaluate our proposals across various datasets, domains and tasks, and show that conservative zero-shot RL algorithms outperform their non-conservative counterparts on low quality datasets, and perform no worse on high quality datasets. Somewhat surprisingly, our proposals also outperform baselines that get to see the task during training. Our code is available via https://enjeeneer.io/projects/zero-shot-rl/ .

Zero-Shot Reinforcement Learning from Low Quality Data

TL;DR

Abstract

Paper Structure (50 sections, 24 equations, 17 figures, 9 tables, 1 algorithm)

This paper contains 50 sections, 24 equations, 17 figures, 9 tables, 1 algorithm.

Introduction
Preliminaries
Zero-Shot RL from Low Quality Data
Failure Mode of Existing Methods
Mitigating the Distribution Shift
A Didactic Example
Experiments
Setup
Baselines
Results
Discussion and Limitations
Related Work
Conclusion
Experimental Details
ExORL Domains
...and 35 more sections

Figures (17)

Figure 1: Conservative zero-shot RL.. (Left) Zero-shot RL methods must train on a dataset collected by a behaviour policy optimising against task $z_{\mathrm{collect}}$, yet generalise to new tasks $z_{\mathrm{eval}}$. Both tasks have associated optimal value functions $Q_{z_{\mathrm{collect}}}^*$ and $Q_{z_{\mathrm{eval}}}^*$ for a given marginal state. (Middle) Existing methods, in this case forward-backward representations (FB), overestimate the value of actions not in the dataset for all tasks. (Right) Value-conservative forward-backward representations (VC-FB) suppress the value of actions not in the dataset for all tasks. Black dots ($\bullet$) represent state-action samples present in the dataset.
Figure 2: FB value overestimation with respect to dataset size $n$ and quality. Log $Q$ values and IQM of rollout performance on all Maze tasks for datasets Rnd and Random. $Q$ values predicted during training increase as both the size and "quality" of the dataset decrease. This contradicts the low return of all resultant policies (note: a return of 1000 is the maximum achievable for this task). Informally, we say the Rnd dataset is "high" quality, and the Random dataset is "low" quality--see Appendix \ref{['appendix: datasets']} for more details.
Figure 3: Ignoring out-of-distribution actions. The agents are tasked with learning separate policies for reaching ${\color{bupu@blue} \circledast}$ and ${\color{bupu@purple} \circledast}$. (a) Rnd dataset with all "left" actions removed; quivers represent the mean action direction in each state bin. (b) Best FB rollout after 1 million learning steps. (c) Best VC-FB performance after 1 million learning steps. FB overestimates the value of OOD actions and cannot complete either task; VC-FB synthesises the requisite information from the dataset and completes both tasks.
Figure 4: Aggregate zero-shot performance on ExORL.(Left) IQM of task scores across datasets and domains, normalised against the performance of CQL, our baseline. (Right) Performance profiles showing the distribution of scores across all tasks and domains. Both conservative FB variants stochastically dominate vanilla FB--see agarwal2021deep for performance profile exposition. The black dashed line represents the IQM of CQL performance across all datasets, domains, tasks and seeds.
Figure 5: Performance by dataset/domain on ExORL. IQM scores across tasks/seeds with 95% conf. intervals.
...and 12 more figures

Zero-Shot Reinforcement Learning from Low Quality Data

TL;DR

Abstract

Zero-Shot Reinforcement Learning from Low Quality Data

Authors

TL;DR

Abstract

Table of Contents

Figures (17)