Table of Contents
Fetching ...

Spatial Reasoning and Planning for Deep Embodied Agents

Shu Ishida

TL;DR

This thesis tackles spatial reasoning and planning for embodied agents using data-driven methods aimed at transferability and sample efficiency. It introduces CALVIN, a differentiable planner that learns transition and reward models from expert demonstrations to navigate unseen 3D spaces, with an augmented transition model enforcing legal actions and a 3D pose representation for embodied planning. It also develops SOAP, an unsupervised option-discovery method with forward-backward and policy-gradient objectives to enable long-horizon behavior, alongside LangProp, a code-optimisation framework that leverages large language models to generate and iteratively refine executable policies, demonstrated in Sudoku, CartPole, and CARLA driving. Finally, Voggite demonstrates staged task execution in Minecraft by decomposing complex tasks into sequential stages via memory-like options and transformer backbones, achieving competitive performance in MineRL BASALT. Collectively, the work advances learnable planning, reusable skills, and interpretable, data-driven decision-making for embodied agents across navigation, driving, and virtual-world tasks, while outlining practical limitations and directions for robust, scalable deployment.

Abstract

Humans can perform complex tasks with long-term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning-based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long-term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long-horizon tasks. Options segment a task into subtasks and enable consistent execution of the subtask. SOAP showed robust performances on history-conditional corridor tasks as well as classical benchmarks such as Atari. 3) LangProp, a code optimisation framework using LLMs to solve embodied agent problems that require reasoning by treating code as learnable policies. The framework successfully generated interpretable code with comparable or superior performance to human-written experts in the CARLA autonomous driving benchmark. 4) Voggite, an embodied agent with a vision-to-action transformer backend that solves complex tasks in Minecraft. It achieved third place in the MineRL BASALT Competition by identifying action triggers to segment tasks into multiple stages.

Spatial Reasoning and Planning for Deep Embodied Agents

TL;DR

This thesis tackles spatial reasoning and planning for embodied agents using data-driven methods aimed at transferability and sample efficiency. It introduces CALVIN, a differentiable planner that learns transition and reward models from expert demonstrations to navigate unseen 3D spaces, with an augmented transition model enforcing legal actions and a 3D pose representation for embodied planning. It also develops SOAP, an unsupervised option-discovery method with forward-backward and policy-gradient objectives to enable long-horizon behavior, alongside LangProp, a code-optimisation framework that leverages large language models to generate and iteratively refine executable policies, demonstrated in Sudoku, CartPole, and CARLA driving. Finally, Voggite demonstrates staged task execution in Minecraft by decomposing complex tasks into sequential stages via memory-like options and transformer backbones, achieving competitive performance in MineRL BASALT. Collectively, the work advances learnable planning, reusable skills, and interpretable, data-driven decision-making for embodied agents across navigation, driving, and virtual-world tasks, while outlining practical limitations and directions for robust, scalable deployment.

Abstract

Humans can perform complex tasks with long-term objectives by planning, reasoning, and forecasting outcomes of actions. For embodied agents to achieve similar capabilities, they must gain knowledge of the environment transferable to novel scenarios with a limited budget of additional trial and error. Learning-based approaches, such as deep RL, can discover and take advantage of inherent regularities and characteristics of the application domain from data, and continuously improve their performances, however at a cost of large amounts of training data. This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks, focusing on enhancing learning efficiency, interpretability, and transferability across novel scenarios. Four key contributions are made. 1) CALVIN, a differential planner that learns interpretable models of the world for long-term planning. It successfully navigated partially observable 3D environments, such as mazes and indoor rooms, by learning the rewards and state transitions from expert demonstrations. 2) SOAP, an RL algorithm that discovers options unsupervised for long-horizon tasks. Options segment a task into subtasks and enable consistent execution of the subtask. SOAP showed robust performances on history-conditional corridor tasks as well as classical benchmarks such as Atari. 3) LangProp, a code optimisation framework using LLMs to solve embodied agent problems that require reasoning by treating code as learnable policies. The framework successfully generated interpretable code with comparable or superior performance to human-written experts in the CARLA autonomous driving benchmark. 4) Voggite, an embodied agent with a vision-to-action transformer backend that solves complex tasks in Minecraft. It achieved third place in the MineRL BASALT Competition by identifying action triggers to segment tasks into multiple stages.
Paper Structure (208 sections, 60 equations, 30 figures, 11 tables)

This paper contains 208 sections, 60 equations, 30 figures, 11 tables.

Figures (30)

  • Figure 1: (1st column) Input images seen during a run of calvincalvin on avdavd (\ref{['sec:calvin/experiments/avd']}). This embodied neural network has learned to efficiently explore and navigate unseen indoor environments, to seek objects of a given class (highlighted in the last image). (2nd-3rd columns) Predicted rewards and values (resp.), for each spatial location (higher for brighter values). The unknown optimal trajectory is dashed, while the robot's trajectory is solid.
  • Figure 2: (left) A 2D maze, with the target in yellow. (middle) Values produced by the vinvin for each 2D state (actions are taken towards the highest value). Higher values are brighter. The correct trajectory is dashed, and the current one is solid. The agent (orange circle) is stuck due to the local maximum below it. (right) Same values for calvincalvin. There are no spurious maxima, and the values of walls are correctly considered low (dark).
  • Figure 3: Example rollout of calvincalvin after $21$ steps (left column), $43$ steps (middle column) and $65$ steps (right column). calvincalvin successfully terminated at $65$ steps. (top row) Input visualisation: unexplored cells are dark, and the discovered target is yellow. The correct trajectory is dashed, and the current one is solid. The orange circle shows the position of the agent. (bottom row) Predicted values (higher values are brighter). Explored cells have low values, while unexplored cells and the discovered target are assigned high values.
  • Figure 4: calvincalvin's learnt rewards and values on partially observable 2D mazes with embodied navigation (\ref{['sec:calvin/experiments/embodied']}). (left) Input visualisation: unexplored cells are dark, the target is yellow (just found by the agent), and a black arrow shows the agent's position and orientation. (middle) Close-up of predicted rewards (higher values are brighter) inside the white rectangle of the left panel. The 3D state space (position/orientation) is shown, with rewards for the 8 orientations in a radial pattern within each cell (position). Explored cells have low rewards, with the highest reward at the target. (right) Close-up of predicted values. They are higher facing the direction of the target. Obstacles (black border) have low values.
  • Figure 5: Example results on MiniWorld (\ref{['sec:calvin/experiments/miniworld']}). Left to right: input images, predicted rewards and values. The format is as in \ref{['fig:calvin/avd']}. Notice the high reward on unexplored regions, replaced with a single peak around the target when it is seen (last row).
  • ...and 25 more figures