Table of Contents
Fetching ...

Verifiably Following Complex Robot Instructions with Foundation Models

Benedict Quartey, Eric Rosen, Stefanie Tellex, George Konidaris

TL;DR

This paper tackles verifiable instruction following for mobile robots in unstructured environments by translating natural language commands into Linear Temporal Logic (LTL) specifications and grounding open‑vocabulary referents in a dynamically generated 3D semantic map. It introduces LIMP, a modular pipeline that (i) translates language into LTL with composable referent descriptors, (ii) grounds referents via a Vision‑Language grounding module to form a Referent Semantic Map, and (iii) synthesizes a verifiable Task and Motion Plan using a Progressive Motion Planner and Task Progression Semantic Maps. The approach enables correct‑by‑construction behavior for long‑horizon tasks with complex spatiotemporal constraints and outperforms state‑of‑the‑art baselines on large real‑world evaluations, with strong results on complex temporal constraints (CT/CST). Limitations include dependence on VLM accuracy, static environments, and restriction to co‑safe formulas, pointing to future work in dynamic scene handling and optimality improvements. Overall, LIMP demonstrates that combining foundation models with symbolic verification and TAMP yields reliable, open‑world instruction following without prebuilt semantic maps.

Abstract

When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79\% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38\%. See supplementary materials and demo videos at https://robotlimp.github.io

Verifiably Following Complex Robot Instructions with Foundation Models

TL;DR

This paper tackles verifiable instruction following for mobile robots in unstructured environments by translating natural language commands into Linear Temporal Logic (LTL) specifications and grounding open‑vocabulary referents in a dynamically generated 3D semantic map. It introduces LIMP, a modular pipeline that (i) translates language into LTL with composable referent descriptors, (ii) grounds referents via a Vision‑Language grounding module to form a Referent Semantic Map, and (iii) synthesizes a verifiable Task and Motion Plan using a Progressive Motion Planner and Task Progression Semantic Maps. The approach enables correct‑by‑construction behavior for long‑horizon tasks with complex spatiotemporal constraints and outperforms state‑of‑the‑art baselines on large real‑world evaluations, with strong results on complex temporal constraints (CT/CST). Limitations include dependence on VLM accuracy, static environments, and restriction to co‑safe formulas, pointing to future work in dynamic scene handling and optimality improvements. Overall, LIMP demonstrates that combining foundation models with symbolic verification and TAMP yields reliable, open‑world instruction following without prebuilt semantic maps.

Abstract

When instructing robots, users want to flexibly express constraints, refer to arbitrary landmarks, and verify robot behavior, while robots must disambiguate instructions into specifications and ground instruction referents in the real world. To address this problem, we propose Language Instruction grounding for Motion Planning (LIMP), an approach that enables robots to verifiably follow complex, open-ended instructions in real-world environments without prebuilt semantic maps. LIMP constructs a symbolic instruction representation that reveals the robot's alignment with an instructor's intended motives and affords the synthesis of correct-by-construction robot behaviors. We conduct a large-scale evaluation of LIMP on 150 instructions across five real-world environments, demonstrating its versatility and ease of deployment in diverse, unstructured domains. LIMP performs comparably to state-of-the-art baselines on standard open-vocabulary tasks and additionally achieves a 79\% success rate on complex spatiotemporal instructions, significantly outperforming baselines that only reach 38\%. See supplementary materials and demo videos at https://robotlimp.github.io
Paper Structure (25 sections, 4 equations, 7 figures, 2 tables, 2 algorithms)

This paper contains 25 sections, 4 equations, 7 figures, 2 tables, 2 algorithms.

Figures (7)

  • Figure 1: Our approach executing the instruction "Bring the green plush toy to the whiteboard in front of it, watch out for the robot in front of the toy". The robot dynamically detects and grounds open-vocabulary referents with spatial constraints to construct an instruction-specific semantic map, then synthesizes a task and motion plan to solve the task. In this example, the robot navigates from its start location (yellow, A), to the green plush toy (green, B), executes a pick skill then navigates to the whiteboard (blue, C), and executes a place skill. Note that the robot has no prior semantic knowledge of the environment.
  • Figure 2: [A] LIMP translates natural language instructions into temporal logic expressions, where open-vocabulary referents are applied to predicates that correspond to robot skills––note the context-aware resolution of the phrase "blue one" to the referent "blue_sofa". [B] Vision-language models detect referents, while spatial reasoning disambiguates referent instances to generate a 3D semantic map that localizes instruction-specific referents. [C] Finally, the temporal logic expression is compiled into a finite-state automaton, which a task and motion planner uses with dynamically-generated task progression semantic maps to progressively identify goals and constraints in the environment, and generate a plan that satisfies the high-level task specification.
  • Figure 3: An instruction is first translated into a conventional LTL formula $\phi_l$ that loosely captures the desired temporal occurrence of referent objects, then into our LTL syntax $\varphi_l$ with predicate functions that temporally chain required robot skills parameterized by composable referent descriptors.
  • Figure 4: [A] Our spatial grounding module leverages a VLM to detect all referent occurrences from prior observations of the environment. [B] An initial semantic map with all detected referent instances is generated by backprojecting pixels in segmented referent masks unto the 3D map. [C] Each referent’s spatial comparators is resolved with respect to the origin coordinate frame of reference. [D] Failing instances are filtered out to obtain a Referent Semantic Map (RSM) that localizes the exact referent instances described in the instruction.
  • Figure 5: [A] A given instruction translated into our LTL syntax $\varphi_l$ can be compiled into an equivalent finite-state automaton that captures the temporal constraints of the task. A path through this automaton is selected with a strategy that incrementally picks the next progression state from the initial state to the accepting state. The robot then executes the manipulation options and navigation behaviors dictated by this high-level task plan. [B] To execute navigation objectives our approach generates a task progression semantic map (TPSM) that augments the environment with state transition constraints, localizing goal (yellow) and avoidance (red) regions. Generated TPSMs are converted into 2D obstacle maps for constraint-aware continuous path planning.
  • ...and 2 more figures