Table of Contents
Fetching ...

When in Doubt, Think Slow: Iterative Reasoning with Latent Imagination

Martin Benfeghoul, Umais Zahid, Qinghai Guo, Zafeirios Fountas

TL;DR

This work tackles the limitation of model-based RL in unfamiliar environments by introducing a training-free Iterative Inference (II) mechanism that operates at decision-time using latent imagination. By performing gradient-based updates to the agent’s latent state over imagined future rollouts and backpropagating through the policy, II refines state representations to maximize future model evidence, using objectives such as $L_{ELBO}$, $L_{SIG}$, and $L_{PIG}$ (with a regularising term $L_{Reg}$). Empirically, II improves reconstruction quality and task performance across visual 3D navigation tasks, with larger benefits in partially observable environments and for agents with less pre-evaluation training. These findings suggest II as a practical, training-free method to enhance system-2-like reasoning in model-based RL, particularly when new data is scarce or expensive.

Abstract

In an unfamiliar setting, a model-based reinforcement learning agent can be limited by the accuracy of its world model. In this work, we present a novel, training-free approach to improving the performance of such agents separately from planning and learning. We do so by applying iterative inference at decision-time, to fine-tune the inferred agent states based on the coherence of future state representations. Our approach achieves a consistent improvement in both reconstruction accuracy and task performance when applied to visual 3D navigation tasks. We go on to show that considering more future states further improves the performance of the agent in partially-observable environments, but not in a fully-observable one. Finally, we demonstrate that agents with less training pre-evaluation benefit most from our approach.

When in Doubt, Think Slow: Iterative Reasoning with Latent Imagination

TL;DR

This work tackles the limitation of model-based RL in unfamiliar environments by introducing a training-free Iterative Inference (II) mechanism that operates at decision-time using latent imagination. By performing gradient-based updates to the agent’s latent state over imagined future rollouts and backpropagating through the policy, II refines state representations to maximize future model evidence, using objectives such as , , and (with a regularising term ). Empirically, II improves reconstruction quality and task performance across visual 3D navigation tasks, with larger benefits in partially observable environments and for agents with less pre-evaluation training. These findings suggest II as a practical, training-free method to enhance system-2-like reasoning in model-based RL, particularly when new data is scarce or expensive.

Abstract

In an unfamiliar setting, a model-based reinforcement learning agent can be limited by the accuracy of its world model. In this work, we present a novel, training-free approach to improving the performance of such agents separately from planning and learning. We do so by applying iterative inference at decision-time, to fine-tune the inferred agent states based on the coherence of future state representations. Our approach achieves a consistent improvement in both reconstruction accuracy and task performance when applied to visual 3D navigation tasks. We go on to show that considering more future states further improves the performance of the agent in partially-observable environments, but not in a fully-observable one. Finally, we demonstrate that agents with less training pre-evaluation benefit most from our approach.
Paper Structure (37 sections, 9 equations, 24 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 9 equations, 24 figures, 1 table, 1 algorithm.

Figures (24)

  • Figure 1: Comparison of the average performance metrics for iterative inference with two inference objectives versus the default DreamerV3 agent (Baseline). For each objective, we take the results of the rollout length $\lambda$ with the best, significant improvement compared to the baseline (ie. a p-value below 5%), across all metrics, assuming the same value as the baseline if the result isn't significant. Experiments are run over 100 episodes each.
  • Figure 2: Immediate improvement on the current latent state reconstruction metrics when the State IG objective is used with rollout length $\lambda=1$. Values and standard deviation are measured over all environment steps for 100 episodes each.
  • Figure 3: Comparing the different impact on the inference objective and reconstruction metrics over environment steps, with different rollout lengths. All plots show II applied with the State IG objective to DMLab 2 after 600K pre-training steps, compared to the baseline agent. Values are averaged across 100 episodes.
  • Figure 4: DMLab's Collect Good Objects task. https://github.com/google-deepmind/lab/blob/master/game_scripts/levels/contributed/dmlab30/README.md.
  • Figure 5: DMLab's NatLab Fixed Large Map task. https://www.youtube.com/watch?v=ucJEnnn5iC8.
  • ...and 19 more figures