Table of Contents
Fetching ...

Rapid Task-Solving in Novel Environments

Sam Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matt Botvinick, David Raposo

TL;DR

Rapid Task-Solving in Novel Environments (RTS) tackles how agents can instantly operate in unfamiliar settings by learning to explore, remember, and plan within a single episode. The authors introduce Episodic Planning Networks (EPNs), which use iterative self-attention over episodic memories to produce a value-iteration–like planning process that can adapt to new environments. They validate RTS on two domains—Memory&Planning Game and One-Shot StreetLearn—showing that EPNs surpass prior meta-RL baselines, generalize to larger maps and unseen cities, and improve with additional planning iterations. The work demonstrates a scalable path toward deploying AI that can rapidly reason and act in novel environments, with implications for real-world robotics and autonomous navigation.

Abstract

We propose the challenge of rapid task-solving in novel environments (RTS), wherein an agent must solve a series of tasks as rapidly as possible in an unfamiliar environment. An effective RTS agent must balance between exploring the unfamiliar environment and solving its current task, all while building a model of the new environment over which it can plan when faced with later tasks. While modern deep RL agents exhibit some of these abilities in isolation, none are suitable for the full RTS challenge. To enable progress toward RTS, we introduce two challenge domains: (1) a minimal RTS challenge called the Memory&Planning Game and (2) One-Shot StreetLearn Navigation, which introduces scale and complexity from real-world data. We demonstrate that state-of-the-art deep RL agents fail at RTS in both domains, and that this failure is due to an inability to plan over gathered knowledge. We develop Episodic Planning Networks (EPNs) and show that deep-RL agents with EPNs excel at RTS, outperforming the nearest baseline by factors of 2-3 and learning to navigate held-out StreetLearn maps within a single episode. We show that EPNs learn to execute a value iteration-like planning algorithm and that they generalize to situations beyond their training experience. algorithm and that they generalize to situations beyond their training experience.

Rapid Task-Solving in Novel Environments

TL;DR

Rapid Task-Solving in Novel Environments (RTS) tackles how agents can instantly operate in unfamiliar settings by learning to explore, remember, and plan within a single episode. The authors introduce Episodic Planning Networks (EPNs), which use iterative self-attention over episodic memories to produce a value-iteration–like planning process that can adapt to new environments. They validate RTS on two domains—Memory&Planning Game and One-Shot StreetLearn—showing that EPNs surpass prior meta-RL baselines, generalize to larger maps and unseen cities, and improve with additional planning iterations. The work demonstrates a scalable path toward deploying AI that can rapidly reason and act in novel environments, with implications for real-world robotics and autonomous navigation.

Abstract

We propose the challenge of rapid task-solving in novel environments (RTS), wherein an agent must solve a series of tasks as rapidly as possible in an unfamiliar environment. An effective RTS agent must balance between exploring the unfamiliar environment and solving its current task, all while building a model of the new environment over which it can plan when faced with later tasks. While modern deep RL agents exhibit some of these abilities in isolation, none are suitable for the full RTS challenge. To enable progress toward RTS, we introduce two challenge domains: (1) a minimal RTS challenge called the Memory&Planning Game and (2) One-Shot StreetLearn Navigation, which introduces scale and complexity from real-world data. We demonstrate that state-of-the-art deep RL agents fail at RTS in both domains, and that this failure is due to an inability to plan over gathered knowledge. We develop Episodic Planning Networks (EPNs) and show that deep-RL agents with EPNs excel at RTS, outperforming the nearest baseline by factors of 2-3 and learning to navigate held-out StreetLearn maps within a single episode. We show that EPNs learn to execute a value iteration-like planning algorithm and that they generalize to situations beyond their training experience. algorithm and that they generalize to situations beyond their training experience.

Paper Structure

This paper contains 16 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: (a) Rapid Task Solving in Novel Environments (RTS) setup. A new environment is sampled in every episode. Each episode consists of a sequence of tasks which are defined by sampling a new goal state and a new initial state. The agent has a fixed number of steps per episode to complete as many tasks as possible. (b) Episodic Planning Network (EPN) architecture. The EPN uses multiple iterations of a single shared self-attention function over memories retrieved from an episodic storage.
  • Figure 2: Memory&Planning Game. (a) Example $4\!\times\!4$ environment (not observable by the agent) and state-goal observation. (b) Training curves. Performance measured by the average reward per episode, which corresponds to the average number of tasks completed within a 100-step episode (showing the best runs from a large hyper-parameter sweep for each model). (c) Performance measured in the last third of the episodes (post-training), relative to an oracle with perfect information that takes the shortest path to the goal. (d) Example trajectory of a trained EPN agent in the first three tasks of an episode. In the first task, the agent explores optimally without repeating states. In the subsequent tasks, the agent takes the shortest path to the goal. (e) Number of steps taken by an agent to reach the nth goal in an episode.
  • Figure 3: One-Shot StreetLearn. (a) Four example states from two randomly sampled neighborhoods. (b) Example connectivity graphs. (c) Evaluation performance measured on neighborhoods of a held-out city throughout the course of training (showing the best run from a large hyper-parameter sweep for each model). (d) Performance in last third of episode relative to an oracle with perfect information. (e) Number of steps taken by an agent to reach the nth goal in an episode.
  • Figure 4: Iteration analysis and generalization. (a) Performance of an EPN agent on held-out neighborhoods with 5 intersections using planners with 1, 2 and 4 self-attention iterations (showing 3 runs for each condition). (b) Performance on new neighborhoods that were larger (7 and 9 intersections) than the ones used during training (5 intersections). (c) Distance-to-goal decoding accuracy from the output of the planner after 1 to 6 iterations. (d) The ability of EPN activations to predict distance to goal expands out from the goal state (blue arrow) as the number of self-attention iterations increases. See section 6.2 for details.
  • Figure 5: Comparison between architecture variants. The Nxk architecture variant, which scales linearly with the total number of memories, recovers $92\%$ of the performance of the A2A variant, which scales quadratically.
  • ...and 1 more figures