Table of Contents
Fetching ...

A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, Yoav Artzi

TL;DR

This work introduces a persistent spatial semantic representation in a hierarchical language-conditioned model (HLSM) to bridge high-level natural language instructions and long-horizon mobile manipulation. The approach maintains a 3D semantic voxel map with occupancy and observability, shared by a high-level subgoal planner and a low-level action executor, enabling robust long-horizon reasoning without relying on detailed step-by-step instructions. Trained with supervised learning on ALFRED data, the method achieves state-of-the-art results on seen and unseen environments and demonstrates the value of persistent world memory for grounding language in action. The findings highlight practical implications for scalable, language-driven robot control and point to future work in reinforcement learning integration and real-world deployment challenges.

Abstract

Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions.

A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution

TL;DR

This work introduces a persistent spatial semantic representation in a hierarchical language-conditioned model (HLSM) to bridge high-level natural language instructions and long-horizon mobile manipulation. The approach maintains a 3D semantic voxel map with occupancy and observability, shared by a high-level subgoal planner and a low-level action executor, enabling robust long-horizon reasoning without relying on detailed step-by-step instructions. Trained with supervised learning on ALFRED data, the method achieves state-of-the-art results on seen and unseen environments and demonstrates the value of persistent world memory for grounding language in action. The findings highlight practical implications for scalable, language-driven robot control and point to future work in reinforcement learning integration and real-world deployment challenges.

Abstract

Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions.

Paper Structure

This paper contains 40 sections, 10 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Illustration of the task and our hierarchical formulation. The agent receives a high-level task in natural language. It needs to map RGB images to navigation and manipulation actions to complete the task.
  • Figure 2: Model architecture consisting of an observation model, high-level controller ($\pi^{H}$), and low-level controller ($\pi^{L}$). The observation model updates the semantic voxel map state representation from RGB observations. $\pi^{H}$ predicts the next subgoal given the instruction and the map. $\pi^{L}$ outputs a sequence of actions to achieve the subgoal. The semantic voxel map is visualized in the middle with agent position illustrated as a black pillar, ans the current sugoal argument mask in yellow. Other colors are different segmentation classes. Saturated voxels are observed in the current timestep.
  • Figure 2: Development results on validation split. Performance of our full approach, with perception oracles, a perception ablation, $\pi^{H}$ ablations, and $\pi^{L}$ ablations
  • Figure 3: Illustration of the high-level controller $\pi^{H}$ (Section \ref{['sec:model:hlp']}).
  • Figure 4: Qualitative results showcasing successes and failures of our approach. Top row: snapshots of every interaction action taken during a successful task. Action argument masks are overlaid in red over the RGB images. The white numbers are timesteps. Middle-right: illustration of a non-fatal perception error. Middle-left: illustration of a fatal perception error. The agent incorrectly interprets the reflection on the alarm clock as an obstacle, causing the agent (blue star) to believe that the path to the goal (green star) is blocked off. This is reflected in the navigation value function computed by the value iteration network (VIN) tamar2016valueiter, where black cells are obstacles with value $-1$. White cell is the goal with value $1$. Bottom-left: grounding failure. The agent wrongly picks up the cup instead of a bowl. Predicted subgoals are shown in green. Bottom-right: high-level controller and percepton failure. $\pi^{H}$ predicts the wrong subgoal argument class (CD instead of Egg). The segmentation model then mistakes the vase for a CD.
  • ...and 4 more figures