Table of Contents
Fetching ...

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Zakaria Mhammedi, James Cohan

Abstract

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

Abstract

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.
Paper Structure (78 sections, 5 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 78 sections, 5 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Schematic overview of GowU with a single group of particles. The algorithm maintains a population of particles ($p_1, p_2, p_3$) that explore the state space via multi-step rollouts. During an outer step, if a particle reaches a dead state (e.g., $p_2$ on level one), it is pruned and its state is reset via a Reset to the winner---the alive node maximizing accumulated reward, with epistemic uncertainty used as a tie-breaker. After $K$ outer steps, a Group Consolidation Reset syncs all particles, including alive particles, to the current winner.
  • Figure 2: Fully rendered observations from the three hard-exploration Atari games used in our evaluation.
  • Figure 3: MuJoCo continuous-control tasks used in our evaluation. (a--c) Adroit dexterous manipulation tasks using the 24-DoF ShadowHand. (d) AntMaze navigation task (top-down view).
  • Figure 4: Processed visual observations as seen by the agent for each MuJoCo task. Each column shows a frame in the observation stack, from the oldest (Frame $-3$) to the most recent (Frame $0$). For AntMaze, the "Frame top" column shows the global top-down view of the maze; during Phase I (exploration), only this top-down view is used. See \ref{['sec:expsetup']} for an overview and \ref{['app:obs_processing']} for full details on the observation processing pipeline.
  • Figure 5: Phase I exploration on Atari: GowU vs. Go-Exploreecoffet2021first. Mean cumulative reward ($\pm$ std) across 100 seeds as a function of game frames. Go-Explore curves are approximated from Extended Data Fig. 2 in ecoffet2021first. GowU discovers high-scoring trajectories substantially faster across all three games.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3