Table of Contents
Fetching ...

EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

Justin Fu, John D. Co-Reyes, Sergey Levine

TL;DR

This work tackles exploration in deep reinforcement learning under sparse rewards by introducing EX^2, a novelty-driven method that relies on discriminatively trained exemplar models to estimate implicit state densities without training generative models. It establishes a theoretical link between exemplar discriminators and density estimation, and introduces latent-space smoothing and suboptimal-discriminator effects to produce practical density estimates that drive exploration. Two scalable architectures, Amortized Multi-Exemplar and K-Exemplar, are proposed and connected to GAN-style interpretations, enabling efficient training and generalization. Empirically, EX^2 matches or exceeds prior explicit-density methods on simple tasks and significantly outperforms them on challenging vizDoom tasks, demonstrating robust performance in high-dimensional image-based domains.

Abstract

Deep reinforcement learning algorithms have been shown to learn complex tasks using highly general policy classes. However, sparse reward problems remain a significant challenge. Exploration methods based on novelty detection have been particularly successful in such settings but typically require generative or predictive models of the observations, which can be difficult to train when the observations are very high-dimensional and complex, as in the case of raw images. We propose a novelty detection algorithm for exploration that is based entirely on discriminatively trained exemplar models, where classifiers are trained to discriminate each visited state against all others. Intuitively, novel states are easier to distinguish against other states seen during training. We show that this kind of discriminative modeling corresponds to implicit density estimation, and that it can be combined with count-based exploration to produce competitive results on a range of popular benchmark tasks, including state-of-the-art results on challenging egocentric observations in the vizDoom benchmark.

EX2: Exploration with Exemplar Models for Deep Reinforcement Learning

TL;DR

This work tackles exploration in deep reinforcement learning under sparse rewards by introducing EX^2, a novelty-driven method that relies on discriminatively trained exemplar models to estimate implicit state densities without training generative models. It establishes a theoretical link between exemplar discriminators and density estimation, and introduces latent-space smoothing and suboptimal-discriminator effects to produce practical density estimates that drive exploration. Two scalable architectures, Amortized Multi-Exemplar and K-Exemplar, are proposed and connected to GAN-style interpretations, enabling efficient training and generalization. Empirically, EX^2 matches or exceeds prior explicit-density methods on simple tasks and significantly outperforms them on challenging vizDoom tasks, demonstrating robust performance in high-dimensional image-based domains.

Abstract

Deep reinforcement learning algorithms have been shown to learn complex tasks using highly general policy classes. However, sparse reward problems remain a significant challenge. Exploration methods based on novelty detection have been particularly successful in such settings but typically require generative or predictive models of the observations, which can be difficult to train when the observations are very high-dimensional and complex, as in the case of raw images. We propose a novelty detection algorithm for exploration that is based entirely on discriminatively trained exemplar models, where classifiers are trained to discriminate each visited state against all others. Intuitively, novel states are easier to distinguish against other states seen during training. We show that this kind of discriminative modeling corresponds to implicit density estimation, and that it can be combined with count-based exploration to produce competitive results on a range of popular benchmark tasks, including state-of-the-art results on challenging egocentric observations in the vizDoom benchmark.

Paper Structure

This paper contains 29 sections, 4 theorems, 21 equations, 12 figures, 1 table.

Key Result

Proposition 1

(Optimal Discriminator) For a discrete distribution $P_\mathcal{X}(x)$, the optimal discriminator $D_{x^*}$ for exemplar $x^*$ satisfies

Figures (12)

  • Figure 1: A diagram of our a) amortized model architecture and b) the K-exemplar model architecture. Noise is injected after the encoder module (a) or after the shared layers (b). Although possible, we do not tie the encoders of (a) in our experiments.
  • Figure 2: a, b) Illustration of estimated densities on the 2D maze task produced by our model (a), compared to the empirical discretized distribution (b). Our method provides reasonable, somewhat smoothed density estimates. c) Density estimates produced with our implicit density estimator on a toy dataset (top left), with increasing amounts of noise regularization.
  • Figure 3: Example task images. From top to bottom, left to right: Doom, map of the MyWayHome task (goal is green, start is blue), Venture, HalfCheetah.
  • Figure 4: Illustrations of several tasks used in our experiments.
  • Figure 5: Top: 3 of the lowest scoring images on Venture early during training. These are typically pictures of the agent in the "overworld" where it spends most of its time. Bottom: 3 of the highest scoring images, which are typically when the agent enters one of the many rooms with reward. Images are grayscale due to preprocessing of the image.
  • ...and 7 more figures

Theorems & Definitions (8)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof