Table of Contents
Fetching ...

Deep Curiosity Search: Intra-Life Exploration Can Improve Performance on Challenging Deep Reinforcement Learning Problems

Christopher Stanton, Jeff Clune

TL;DR

The paper introduces Deep Curiosity Search (DeepCS), an intra-life exploration method for deep reinforcement learning that rewards visiting new tiles within each episode using a discretized curiosity grid, with the grid reset at life loss. By blending intrinsic rewards with extrinsic environment rewards and using a multi-actor A2C training regime, DeepCS matches or surpasses state-of-the-art across challenging Atari tasks, notably Montezuma's Revenge, and significantly improves performance on Seaquest and several other hard exploration games. Ablation studies reveal the intrinsic rewards are crucial and the curiosity grid enhances exploration, while combining intra-life and across-training novelty is proposed for future improvements. The work demonstrates that intra-life novelty is a viable, complementary exploration signal with potential for hybrid strategies in deep RL.

Abstract

Traditional exploration methods in RL require agents to perform random actions to find rewards. But these approaches struggle on sparse-reward domains like Montezuma's Revenge where the probability that any random action sequence leads to reward is extremely low. Recent algorithms have performed well on such tasks by encouraging agents to visit new states or perform new actions in relation to all prior training episodes (which we call across-training novelty). But such algorithms do not consider whether an agent exhibits intra-life novelty: doing something new within the current episode, regardless of whether those behaviors have been performed in previous episodes. We hypothesize that across-training novelty might discourage agents from revisiting initially non-rewarding states that could become important stepping stones later in training. We introduce Deep Curiosity Search (DeepCS), which encourages intra-life exploration by rewarding agents for visiting as many different states as possible within each episode, and show that DeepCS matches the performance of current state-of-the-art methods on Montezuma's Revenge. We further show that DeepCS improves exploration on Amidar, Freeway, Gravitar, and Tutankham (many of which are hard exploration games). Surprisingly, DeepCS doubles A2C performance on Seaquest, a game we would not have expected to benefit from intra-life exploration because the arena is small and already easily navigated by naive exploration techniques. In one run, DeepCS achieves a maximum training score of 80,000 points on Seaquest, higher than any methods other than Ape-X. The strong performance of DeepCS on these sparse- and dense-reward tasks suggests that encouraging intra-life novelty is an interesting, new approach for improving performance in Deep RL and motivates further research into hybridizing across-training and intra-life exploration methods.

Deep Curiosity Search: Intra-Life Exploration Can Improve Performance on Challenging Deep Reinforcement Learning Problems

TL;DR

The paper introduces Deep Curiosity Search (DeepCS), an intra-life exploration method for deep reinforcement learning that rewards visiting new tiles within each episode using a discretized curiosity grid, with the grid reset at life loss. By blending intrinsic rewards with extrinsic environment rewards and using a multi-actor A2C training regime, DeepCS matches or surpasses state-of-the-art across challenging Atari tasks, notably Montezuma's Revenge, and significantly improves performance on Seaquest and several other hard exploration games. Ablation studies reveal the intrinsic rewards are crucial and the curiosity grid enhances exploration, while combining intra-life and across-training novelty is proposed for future improvements. The work demonstrates that intra-life novelty is a viable, complementary exploration signal with potential for hybrid strategies in deep RL.

Abstract

Traditional exploration methods in RL require agents to perform random actions to find rewards. But these approaches struggle on sparse-reward domains like Montezuma's Revenge where the probability that any random action sequence leads to reward is extremely low. Recent algorithms have performed well on such tasks by encouraging agents to visit new states or perform new actions in relation to all prior training episodes (which we call across-training novelty). But such algorithms do not consider whether an agent exhibits intra-life novelty: doing something new within the current episode, regardless of whether those behaviors have been performed in previous episodes. We hypothesize that across-training novelty might discourage agents from revisiting initially non-rewarding states that could become important stepping stones later in training. We introduce Deep Curiosity Search (DeepCS), which encourages intra-life exploration by rewarding agents for visiting as many different states as possible within each episode, and show that DeepCS matches the performance of current state-of-the-art methods on Montezuma's Revenge. We further show that DeepCS improves exploration on Amidar, Freeway, Gravitar, and Tutankham (many of which are hard exploration games). Surprisingly, DeepCS doubles A2C performance on Seaquest, a game we would not have expected to benefit from intra-life exploration because the arena is small and already easily navigated by naive exploration techniques. In one run, DeepCS achieves a maximum training score of 80,000 points on Seaquest, higher than any methods other than Ape-X. The strong performance of DeepCS on these sparse- and dense-reward tasks suggests that encouraging intra-life novelty is an interesting, new approach for improving performance in Deep RL and motivates further research into hybridizing across-training and intra-life exploration methods.

Paper Structure

This paper contains 10 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: DeepCS encourages agents to visit new places in Montezuma's Revenge. White sections in the curiosity grid (middle) show which locations have been visited; the unvisited black sections yield an exploration bonus when touched. The network receives both game input (left) and curiosity grid (middle) and must learn how to form a map of where the agent has been (hypothetical illustration, right). The grid is reset when the agent loses all lives and starts a new game, encouraging intra-life exploration irrespective of previous games.
  • Figure 2: DeepCS improves performance on sparse- and dense-reward Atari games. In Montezuma's Revenge (a challenging sparse-reward game), DeepCS vastly outperforms the naïve exploration of A2C alone and matches the average performance of state-of-the-art methods (Table \ref{['table:controls']}, DeepCS vs. PC and PCn). On Seaquest (an easier, dense-reward game), DeepCS nearly doubles the median performance of A2C (3443 vs. 1791 points) and obtains nearly 80,000 points in one run. The rightmost plots show intrinsic rewards, quantifying how much of the game world has been explored; horizontal bars below each plot indicate statistical significance. For additional games, see SI Fig. \ref{['fig:otherAtari']}.
  • Figure 3: DeepCS produces agents that explore a large percentage of Montezuma's Revenge. The game starts in the upper-center room; filled sections indicate explored rooms and blank areas are unexplored. The best DeepCS agent explores 15 rooms (right), matching state-of-the-art techniques bellemare:pseudo. Most algorithms barely explore 1-2 rooms of this very difficult game. Agents that move west from the initial room (left) seem rare in DeepCS (3 out of 25 runs) and in the literature.
  • Figure 4: We did not expect DeepCS to help Seaquest because the curiosity grid can quickly saturate. At game start, the grid is unfilled and many intrinsic rewards can be obtained (left). However, after only 700 game frames (middle), the grid is nearly saturated; DeepCS can only provide training feedback for a few hard-to-reach tiles. However, even this brief presence of intrinsic rewards allows agents to learn behaviors affording them e.g. 76,000 points on Seaquest in a single run (right).
  • Figure 5: DeepCS does not need the curiosity grid input to improve domain exploration. When the curiosity grid has been removed (No Grid), agents perform equally well on Seaquest; performance drops on MR but remains better than A2C alone. Both games suffer when intrinsic rewards are removed (No Intrinsic), suggesting that intrinsic rewards (not the grid) are the key aspect of DeepCS.
  • ...and 4 more figures