Table of Contents
Fetching ...

An investigation of model-free planning

Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sébastien Racanière, Théophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap

TL;DR

The paper shows that a purely model-free, high-capacity neural network can exhibit planning-like behavior in challenging combinatorial domains, without explicit planning structures. It introduces the DR C architecture, a stack of ConvLSTM memory modules with iterative processing and encoder-driven inputs, trained with model-free RL to produce policy and value, and demonstrates strong generalization, data efficiency, and improved performance with added computation time. Across Sokoban, Boxworld, MiniPacman, Gridworld, and planning-focused Atari tasks, DR C achieves state-of-the-art results and often outperforms specialized planning baselines, suggesting that planning can emerge from general-purpose networks. These findings imply that combining flexible architectures with planning-oriented biases could yield even more powerful agents in complex environments.

Abstract

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.

An investigation of model-free planning

TL;DR

The paper shows that a purely model-free, high-capacity neural network can exhibit planning-like behavior in challenging combinatorial domains, without explicit planning structures. It introduces the DR C architecture, a stack of ConvLSTM memory modules with iterative processing and encoder-driven inputs, trained with model-free RL to produce policy and value, and demonstrates strong generalization, data efficiency, and improved performance with added computation time. Across Sokoban, Boxworld, MiniPacman, Gridworld, and planning-focused Atari tasks, DR C achieves state-of-the-art results and often outperforms specialized planning baselines, suggesting that planning can emerge from general-purpose networks. These findings imply that combining flexible architectures with planning-oriented biases could yield even more powerful agents in complex environments.

Abstract

The field of reinforcement learning (RL) is facing increasingly challenging domains with combinatorial complexity. For an RL agent to address these challenges, it is essential that it can plan effectively. Prior work has typically utilized an explicit model of the environment, combined with a specific planning algorithm (such as tree search). More recently, a new family of methods have been proposed that learn how to plan, by providing the structure for planning via an inductive bias in the function approximator (such as a tree structured neural network), trained end-to-end by a model-free RL algorithm. In this paper, we go even further, and demonstrate empirically that an entirely model-free approach, without special structure beyond standard neural network components such as convolutional networks and LSTMs, can learn to exhibit many of the characteristics typically associated with a model-based planner. We measure our agent's effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.

Paper Structure

This paper contains 45 sections, 15 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Illustration of the agent's network architecture. This diagram shows DRC(2,3) for two time steps. Square boxes denote ConvLSTM modules and the rectangle box represents an MLP. Boxes with the same color share parameters.
  • Figure 2: Examples of Sokoban levels from the (a) unfiltered, (b) medium test sets, and from the (c) hard set. Our best model is able to solve all three levels.
  • Figure 3: a) Learning curves for various configurations of DRC in Sokoban-Unfiltered. b) Comparison with other network architectures tuned for Sokoban. Results are on test-set levels.
  • Figure 4: Comparison of DRC(3,3) (Top, Large network) and DRC(1,1) (Bottom, Small network) when trained with RL on various train set sizes (subsets of the Sokoban-unfiltered training set). Left column shows the performance on levels from the corresponding train set, right column shows the performance on the test set (the same set across these experiments).
  • Figure 5: Generalization results from a trained model on different training set size (Large, Medium and Small subsets of the unfiltered training dataset) in Sokoban when evaluated on (a) the unfiltered test set and (b) the medium-difficulty test set. (c) Similar generalization results for trained models in Boxworld. (These figures show a summary of results in Figure \ref{['fig:overfit_sokoban']} and Appendix Fig. \ref{['fig:overfit_boxword']}.)
  • ...and 10 more figures