Table of Contents
Fetching ...

On Zero-Shot Reinforcement Learning

Scott Jeen

TL;DR

The thesis tackles zero-shot reinforcement learning in real-world settings where simulators are imperfect and task data may be scarce. It advances three core ideas: (i) conservative zero-shot RL to mitigate out-of-distribution value overestimation when data quality is low; (ii) memory-augmented zero-shot RL to handle partial observability and misidentified tasks; and (iii) no-prior-data building control via PEARL, enabling emission-efficient control with minimal commissioning data. Across these threads, the work demonstrates substantial improvements on benchmarks (ExORL, D4RL) and in building-control scenarios, showing that zero-shot policies can adapt to unseen tasks, dynamics, or rewards while respecting practical constraints. The contributions collectively push zero-shot RL toward deployable real-world impact, offering principled strategies for data-scarce, partially observed, or data-less domains. The results highlight the importance of conservatism, memory, and data-efficient planning for bridging the sim-to-real gap and point to a practical path for RL-driven systems in energy and beyond.

Abstract

Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately correct, and can be pathologically incorrect when queried outside of their training distribution. As a result, a misalignment between the environments in which we train our agents and the real-world in which we wish to deploy our agents is inevitable. Dealing with this misalignment is the primary concern of zero-shot reinforcement learning, a problem setting where the agent must generalise to a new task or domain with zero practice shots. Whilst impressive progress has been made on methods that perform zero-shot RL in idealised settings, new work is needed if these results are to be replicated in real-world settings. In this thesis, we argue that doing so requires us to navigate (at least) three constraints. First, the data quality constraint: real-world datasets are small and homogeneous. Second, the observability constraint: states, dynamics and rewards in the real-world are often only partially observed. And third, the data availability constraint: a priori access to data cannot always be assumed. This work proposes a suite of methods that perform zero-shot RL subject to these constraints. In a series of empirical studies we expose the failings of existing methods, and justify our techniques for remedying them. We believe these designs take us a step closer to RL methods that can be deployed to solve real-world problems.

On Zero-Shot Reinforcement Learning

TL;DR

The thesis tackles zero-shot reinforcement learning in real-world settings where simulators are imperfect and task data may be scarce. It advances three core ideas: (i) conservative zero-shot RL to mitigate out-of-distribution value overestimation when data quality is low; (ii) memory-augmented zero-shot RL to handle partial observability and misidentified tasks; and (iii) no-prior-data building control via PEARL, enabling emission-efficient control with minimal commissioning data. Across these threads, the work demonstrates substantial improvements on benchmarks (ExORL, D4RL) and in building-control scenarios, showing that zero-shot policies can adapt to unseen tasks, dynamics, or rewards while respecting practical constraints. The contributions collectively push zero-shot RL toward deployable real-world impact, offering principled strategies for data-scarce, partially observed, or data-less domains. The results highlight the importance of conservatism, memory, and data-efficient planning for bridging the sim-to-real gap and point to a practical path for RL-driven systems in energy and beyond.

Abstract

Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately correct, and can be pathologically incorrect when queried outside of their training distribution. As a result, a misalignment between the environments in which we train our agents and the real-world in which we wish to deploy our agents is inevitable. Dealing with this misalignment is the primary concern of zero-shot reinforcement learning, a problem setting where the agent must generalise to a new task or domain with zero practice shots. Whilst impressive progress has been made on methods that perform zero-shot RL in idealised settings, new work is needed if these results are to be replicated in real-world settings. In this thesis, we argue that doing so requires us to navigate (at least) three constraints. First, the data quality constraint: real-world datasets are small and homogeneous. Second, the observability constraint: states, dynamics and rewards in the real-world are often only partially observed. And third, the data availability constraint: a priori access to data cannot always be assumed. This work proposes a suite of methods that perform zero-shot RL subject to these constraints. In a series of empirical studies we expose the failings of existing methods, and justify our techniques for remedying them. We believe these designs take us a step closer to RL methods that can be deployed to solve real-world problems.

Paper Structure

This paper contains 109 sections, 76 equations, 34 figures, 13 tables, 2 algorithms.

Figures (34)

  • Figure 1: Three paradigms of reinforcement learning. (Top) Model-free RL methods distill future rewards$r_t, r_{t+1}, r_{t+2}, \ldots$ into a value function, policy or both (§\ref{['background: model-free RL']}). (Middle) One-step model-based RL methods predict the state transition$s_t \leadsto s_{t+1}$ with a model (§\ref{['background: one-step MBRL']}). (Bottom) Multi-step model-based RL methods distill future state transitions$s_t, s_{t+1}, s_{t+2}, \ldots$ into a model. (§\ref{['background: multi-step MBRL']})
  • Figure 2: Trajectory stitching in offline RL.(Left) A graph MDP $\mathcal{M}$ where the task is to find the shortest path to the goal state. (Middle) A dataset of offline trajectories $\mathcal{D}_{\text{offline}}$ that may not contain the optimal trajectory for the task. (Right) The policy $\pi$ learns to combine sub-trajectories to find the shortest path to the goal.
  • Figure 3: Behaviour cloning $\leftrightarrow$ reinforcement learning continuum for Offline RL methods. At the left end lie methods that attempt to mimic $\pi_{\beta}$, the policy that produced the data. The further right one moves the less the methods attempt to mimic $\pi_{\beta}$ and the closer they are to full RL methods.
  • Figure 4: Conservative zero-shot RL.. (Left) Zero-shot RL methods must train on a dataset collected by a behaviour policy optimising against task $z_{\mathrm{collect}}$, yet generalise to new tasks $z_{\mathrm{eval}}$. Both tasks have associated optimal value functions $Q_{z_{\mathrm{collect}}}^*$ and $Q_{z_{\mathrm{eval}}}^*$ for a given marginal state. (Middle) Existing methods, in this case forward-backward representations (FB), overestimate the value of actions not in the dataset for all tasks. (Right) Value-conservative forward-backward representations (VC-FB) suppress the value of actions not in the dataset for all tasks. Black dots represent state-action samples present in the dataset.
  • Figure 5: FB value overestimation with respect to dataset size $n$ and quality. Log $Q$ values and IQM of rollout performance on all Point-mass Maze tasks for datasets Rnd and Random. $Q$ values predicted during training increase as both the size and "quality" of the dataset decrease. This contradicts the low return of all resultant policies (note: a return of 1000 is the maximum achievable for this task). Informally, we say the Rnd dataset is "high" quality, and the Random dataset is "low" quality--see Appendix \ref{['appendix: exorl datasets']} for more details.
  • ...and 29 more figures