Table of Contents
Fetching ...

Explore to Generalize in Zero-Shot RL

Ev Zisselman, Itai Lavie, Daniel Soudry, Aviv Tamar

TL;DR

Zero-shot generalization in reinforcement learning remains challenging, with invariance-based methods failing on several ProcGen tasks. The authors propose Explore to Generalize (ExpGen), which combines a maximum-entropy exploration policy with an ensemble of reward-maximizing policies to handle test-time epistemic uncertainty through directed exploration. The key finding is that exploration-oriented behavior generalizes more robustly than reward memorization, enabling strong transfer to unseen levels and achieving state-of-the-art results on several ProcGen games (e.g., Maze 83%, Heist 74% with 200 training levels). ExpGen can be complementary to invariance-based approaches, yielding top performance when combined with IDAAC, and is supported by ablations that illuminate ensemble size, agreement, and meta-stability considerations. Overall, the work highlights exploration as a principled route to robust zero-shot RL generalization and provides practical guidance for balancing exploration and exploitation at test time.

Abstract

We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that effectively $\textit{explores}$ the domain is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our $\textit{Explore to Generalize}$ algorithm (ExpGen) builds on this insight: we train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which generalize well and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far eluded effective generalization, yielding a success rate of $83\%$ on the Maze task and $74\%$ on Heist with $200$ training levels. ExpGen can also be combined with an invariance based approach to gain the best of both worlds, setting new state-of-the-art results on ProcGen.

Explore to Generalize in Zero-Shot RL

TL;DR

Zero-shot generalization in reinforcement learning remains challenging, with invariance-based methods failing on several ProcGen tasks. The authors propose Explore to Generalize (ExpGen), which combines a maximum-entropy exploration policy with an ensemble of reward-maximizing policies to handle test-time epistemic uncertainty through directed exploration. The key finding is that exploration-oriented behavior generalizes more robustly than reward memorization, enabling strong transfer to unseen levels and achieving state-of-the-art results on several ProcGen games (e.g., Maze 83%, Heist 74% with 200 training levels). ExpGen can be complementary to invariance-based approaches, yielding top performance when combined with IDAAC, and is supported by ablations that illuminate ensemble size, agreement, and meta-stability considerations. Overall, the work highlights exploration as a principled route to robust zero-shot RL generalization and provides practical guidance for balancing exploration and exploitation at test time.

Abstract

We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. To mitigate overfitting, previous work explored different notions of invariance to the task. However, on problems such as the ProcGen Maze, an adequate solution that is invariant to the task visualization does not exist, and therefore invariance-based approaches fail. Our insight is that learning a policy that effectively the domain is harder to memorize than a policy that maximizes reward for a specific task, and therefore we expect such learned behavior to generalize well; we indeed demonstrate this empirically on several domains that are difficult for invariance-based approaches. Our algorithm (ExpGen) builds on this insight: we train an additional ensemble of agents that optimize reward. At test time, either the ensemble agrees on an action, and we generalize well, or we take exploratory actions, which generalize well and drive us to a novel part of the state space, where the ensemble may potentially agree again. We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far eluded effective generalization, yielding a success rate of on the Maze task and on Heist with training levels. ExpGen can also be combined with an invariance based approach to gain the best of both worlds, setting new state-of-the-art results on ProcGen.
Paper Structure (30 sections, 11 equations, 14 figures, 12 tables, 1 algorithm)

This paper contains 30 sections, 11 equations, 14 figures, 12 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a),(b),(c),(d) and (e) displays screenshot of ProcGen games. (f) Imaginary maze with goal and walls removed (see text for explanation).
  • Figure 2: Normalized test Performance for ExpGen, LEEP, IDAAC, DAAC, and PPO, on five ProcGen games. ExpGen shows state-of-the-art performance on test levels of Maze, Heist and Jumper; games that are notoriously challenging for other leading approaches. The scores are normalized as proposed by cobbe2020leveraging.
  • Figure 3: Example of a maxEnt trajectory on Maze. The policy visits every reachable state and averts termination by avoiding the goal state.
  • Figure 4: Generalization ability of maximum entropy vs. extrinsic reward: (a) Score of maximum entropy. (b) Score of extrinsic reward. Training for maximum entropy exhibits a small generalization gap in Maze, Jumper and Miner. Average and standard deviation are obtained using $4$ seeds.
  • Figure 5: Test performance of PPO trained using the reward $r_{total}$ that combines intrinsic and extrinsic rewards, weighted by $\beta$ (Eq. \ref{['eq:r_total']}). Each figure details the results for different values of discount factor $\gamma$. All networks are randomly initialized and trained on $200$ maze levels, and their mean is computed over $4$ runs with different seeds. The figures show an improvement over the PPO baseline for $\gamma=0.5$. In all cases, ExpGen outperforms the combined reward agent.
  • ...and 9 more figures