When Remembering and Planning are Worth it: Navigating under Change

Omid Madani; J. Brian Burns; Reza Eghbali; Thomas L. Dean

When Remembering and Planning are Worth it: Navigating under Change

Omid Madani, J. Brian Burns, Reza Eghbali, Thomas L. Dean

TL;DR

It is found that an architecture that can incorporate multiple strategies is required to handle tasks of a different nature, in particular for exploration and search, when food location is not known, and for planning a good path to a remembered (likely) food location.

Abstract

We explore how different types and uses of memory can aid spatial navigation in changing uncertain environments. In the simple foraging task we study, every day, our agent has to find its way from its home, through barriers, to food. Moreover, the world is non-stationary: from day to day, the location of the barriers and food may change, and the agent's sensing such as its location information is uncertain and very limited. Any model construction, such as a map, and use, such as planning, needs to be robust against these challenges, and if any learning is to be useful, it needs to be adequately fast. We look at a range of strategies, from simple to sophisticated, with various uses of memory and learning. We find that an architecture that can incorporate multiple strategies is required to handle (sub)tasks of a different nature, in particular for exploration and search, when food location is not known, and for planning a good path to a remembered (likely) food location. An agent that utilizes non-stationary probability learning techniques to keep updating its (episodic) memories and that uses those memories to build maps and plan on the fly (imperfect maps, i.e. noisy and limited to the agent's experience) can be increasingly and substantially more efficient than the simpler (minimal-memory) agents, as the task difficulties such as distance to goal are raised, as long as the uncertainty, from localization and change, is not too large.

When Remembering and Planning are Worth it: Navigating under Change

TL;DR

Abstract

Paper Structure (36 sections, 9 figures, 6 tables)

This paper contains 36 sections, 9 figures, 6 tables.

Introduction
The Environment, the Agent, and Task(s)
The Task(s): Getting to Food
Change
Limited Sensing (Observing, Localizing, ..)
Agent Structure
Agent's Scheduling Logic and Progressive Time Budgets
Agent's Interfacing with the Strategies
Bypassing and Support for a Strategy Hierarchy
Strategies
Random Strategies
Greedy Strategies
Least-Visited (medium-term, or a day's, memory)
Path-Memory (longer, over-days, memory)
Relation to Model-Free RL
...and 21 more sections

Figures (9)

Figure 1: (a) The agent and its environment. It is important to emphasize that the agent does not see the whole grid, just the locations immediately adjacent to its current location (partial observability). (b) Two knobs on task complexity: barrier proportion and environment size (distance to goal). c) A 3rd knob on task difficulty: rate of (barrier) change, from day to day.
Figure 2: Basic actions and sensing: (a) 4 possible actions: LEFT (west), RIGHT, UP, or DOWN. (b) In this example, with two barriers, the agent has two legal actions (left and down). (c) Motion noise, up to 6 possibilities: when intending to go east (right), with some (noise) probability, the agent may end up in another location: stay in the same cell, go up, or down, or left, or go two hops east. (d) Sensing is also from a single adjacent cell (4 such).
Figure 3: The control loop of a multi-strategy agent, responsible for the agent's daily activity. Each strategy has to provide an action-selection function, but other functions are optional. See Sect. \ref{['sec:budgets']} on how the agent changes its strategies (to find/reach goal), and Sect. \ref{['sec:fns']} for the descriptions of the different functions.
Figure 4: An agent can use several strategies, or behavior modes, in a round-robin (fixed) ordered way in this paper (Sect. \ref{['sec:budgets']}): (a) Each day the agent begins with using the first strategy. It moves to the next strategy (wrapping around), when current active strategy fails, or the strategy's time is up, until goal is reached (or all strategies fail). The allotted times are doubled each time it starts the list over in that day. (b) An example composite agent: The Greedy+Biased (mixed-greedy) agent begins with the greedy strategy and transitions to (biased) random in case of the failure of the greedy, and repeats this loop (each time, doubling the time allotted to random), until food is found.
Figure 5: A summary of what different strategies use or require (mainly of the agent, but also of the environment). Greedy requires the smell ('gradient') direction. Localization, ie availability of the $(\tilde{x},\tilde{y})$ estimate of the current location for the agent, need not be perfect (Sect. \ref{['sec:sense']}). DQN's long-term memory is in its neural-network weights, and its input vector includes current $(\tilde{x},\tilde{y})$ in our experiments.
...and 4 more figures

When Remembering and Planning are Worth it: Navigating under Change

TL;DR

Abstract

When Remembering and Planning are Worth it: Navigating under Change

Authors

TL;DR

Abstract

Table of Contents

Figures (9)