Table of Contents
Fetching ...

Thinking agents for zero-shot generalization to qualitatively novel tasks

Thomas Miconi, Kevin McKee, Yicong Zheng, Jed McCaleb

TL;DR

This work tackles qualitative generalization by training agents to think, i.e., internally simulate and evaluate trajectories to solve truly novel tasks without On-task learning. It achieves this by withholding a specific combination of environment elements during training to create a test task that is solvable only through internal thinking, and by evolving the training tasks to maximize differences between pre-thinking and post-thinking performance. The agent uses a separate world model to imagine futures, with thinking trials conducted in the actual environment during training and via the world model at test time; results show emergent thinking capabilities and task performance driven by imagined outcomes. The approach demonstrates that controlled withholding of element combinations can foster genuine, zero-shot qualitative generalization, and identifies the world-model predictor as a key bottleneck, suggesting directions for more efficient thinking-based planning.

Abstract

Intelligent organisms can solve truly novel problems which they have never encountered before, either in their lifetime or their evolution. An important component of this capacity is the ability to ``think'', that is, to mentally manipulate objects, concepts and behaviors in order to plan and evaluate possible solutions to novel problems, even without environment interaction. To generate problems that are truly qualitatively novel, while still solvable zero-shot (by mental simulation), we use the combinatorial nature of environments: we train the agent while withholding a specific combination of the environment's elements. The novel test task, based on this combination, is thus guaranteed to be truly novel, while still mentally simulable since the agent has been exposed to each individual element (and their pairwise interactions) during training. We propose a method to train agents endowed with world models to make use their mental simulation abilities, by selecting tasks based on the difference between the agent's pre-thinking and post-thinking performance. When tested on the novel, withheld problem, the resulting agent successfully simulated alternative scenarios and used the resulting information to guide its behavior in the actual environment, solving the novel task in a single real-environment trial (zero-shot).

Thinking agents for zero-shot generalization to qualitatively novel tasks

TL;DR

This work tackles qualitative generalization by training agents to think, i.e., internally simulate and evaluate trajectories to solve truly novel tasks without On-task learning. It achieves this by withholding a specific combination of environment elements during training to create a test task that is solvable only through internal thinking, and by evolving the training tasks to maximize differences between pre-thinking and post-thinking performance. The agent uses a separate world model to imagine futures, with thinking trials conducted in the actual environment during training and via the world model at test time; results show emergent thinking capabilities and task performance driven by imagined outcomes. The approach demonstrates that controlled withholding of element combinations can foster genuine, zero-shot qualitative generalization, and identifies the world-model predictor as a key bottleneck, suggesting directions for more efficient thinking-based planning.

Abstract

Intelligent organisms can solve truly novel problems which they have never encountered before, either in their lifetime or their evolution. An important component of this capacity is the ability to ``think'', that is, to mentally manipulate objects, concepts and behaviors in order to plan and evaluate possible solutions to novel problems, even without environment interaction. To generate problems that are truly qualitatively novel, while still solvable zero-shot (by mental simulation), we use the combinatorial nature of environments: we train the agent while withholding a specific combination of the environment's elements. The novel test task, based on this combination, is thus guaranteed to be truly novel, while still mentally simulable since the agent has been exposed to each individual element (and their pairwise interactions) during training. We propose a method to train agents endowed with world models to make use their mental simulation abilities, by selecting tasks based on the difference between the agent's pre-thinking and post-thinking performance. When tested on the novel, withheld problem, the resulting agent successfully simulated alternative scenarios and used the resulting information to guide its behavior in the actual environment, solving the novel task in a single real-environment trial (zero-shot).

Paper Structure

This paper contains 17 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Environment tasks (levels). The agent is represented as '@'. Left: The test task, based on a combination of Zombies (Z), zombie-killing angels (A) and killing/digging blocks (X). Right: Random training task. Notice the presence of Angels (A) and digging blocks (X), but the absence of Zombies (Z). No training task can contain all three of Z, A and X.
  • Figure 2: Total reward on the novel test task, for trial 1 (first 'thinking' trial), and trial 4 (single real-environment trial). Gray shaded area represents the pre-training period on a fixed set of hand-defined tasks; training after this includes randomly generated, selected tasks.
  • Figure 3: Test task: proportion of test episodes in which the angel door is opened, in a batch of 5000 test episodes with the fully trained agent. For 'thinking' episodes (gray shaded area), this is assessed by running the internally generated actions in a copy of the actual environment (the agent's perceived observations during 'thinking' are always internally generated from its own world model, with no interaction with the actual environment).
  • Figure A1: Proportion of test episodes with the angel door down, but replacing the agent's world model with the true environment simulator (conventions are as in Figure \ref{['fig:angeldoordown']}).
  • Figure A2: Same as Figure \ref{['fig:angeldoordown_inenv']}, but passing the observations through the agent's latent encoder. High performance suggests that most of the loss when thinking in the world model (compare Figure \ref{['fig:angeldoordown']}) comes mainly from the next-latent predictor, rather than information loss in the latent.