A Persistent Spatial Semantic Representation for High-level Natural Language Instruction Execution
Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, Yoav Artzi
TL;DR
This work introduces a persistent spatial semantic representation in a hierarchical language-conditioned model (HLSM) to bridge high-level natural language instructions and long-horizon mobile manipulation. The approach maintains a 3D semantic voxel map with occupancy and observability, shared by a high-level subgoal planner and a low-level action executor, enabling robust long-horizon reasoning without relying on detailed step-by-step instructions. Trained with supervised learning on ALFRED data, the method achieves state-of-the-art results on seen and unseen environments and demonstrates the value of persistent world memory for grounding language in action. The findings highlight practical implications for scalable, language-driven robot control and point to future work in reinforcement learning integration and real-world deployment challenges.
Abstract
Natural language provides an accessible and expressive interface to specify long-term tasks for robotic agents. However, non-experts are likely to specify such tasks with high-level instructions, which abstract over specific robot actions through several layers of abstraction. We propose that key to bridging this gap between language and robot actions over long execution horizons are persistent representations. We propose a persistent spatial semantic representation method, and show how it enables building an agent that performs hierarchical reasoning to effectively execute long-term tasks. We evaluate our approach on the ALFRED benchmark and achieve state-of-the-art results, despite completely avoiding the commonly used step-by-step instructions.
