Verifiably Following Complex Robot Instructions with Foundation Models
Benedict Quartey, Eric Rosen, Stefanie Tellex, George Konidaris
TL;DR
This paper tackles verifiable instruction following for mobile robots in unstructured environments by translating natural language commands into Linear Temporal Logic (LTL) specifications and grounding open‑vocabulary referents in a dynamically generated 3D semantic map. It introduces LIMP, a modular pipeline that (i) translates language into LTL with composable referent descriptors, (ii) grounds referents via a Vision‑Language grounding module to form a Referent Semantic Map, and (iii) synthesizes a verifiable Task and Motion Plan using a Progressive Motion Planner and Task Progression Semantic Maps. The approach enables correct‑by‑construction behavior for long‑horizon tasks with complex spatiotemporal constraints and outperforms state‑of‑the‑art baselines on large real‑world evaluations, with strong results on complex temporal constraints (CT/CST). Limitations include dependence on VLM accuracy, static environments, and restriction to co‑safe formulas, pointing to future work in dynamic scene handling and optimality improvements. Overall, LIMP demonstrates that combining foundation models with symbolic verification and TAMP yields reliable, open‑world instruction following without prebuilt semantic maps.
Abstract
When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79\% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38\%. See supplementary materials and demo videos at https://robotlimp.github.io
