Table of Contents
Fetching ...

CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots

Dmitriy Rivkin, Nikhil Kakodkar, Francois Hogan, Bobak H. Baghi, Gregory Dudek

TL;DR

The paper addresses enabling natural, conversational instruction execution for robot navigation by grounding LLM reasoning in a scene-aware representation. It introduces CARTIER, a five-stage pipeline that detects objects, builds a spatial language index, and uses LLMs to map ambiguous user queries to concrete navigation targets within AI2Thor, then validates against baselines with both perceptual and localization metrics. CARTIER substantially improves localization accuracy, particularly for long, conversational queries, and demonstrates feasibility with a real-world telepresence deployment. The work advances open-vocabulary, language-grounded robotics by integrating object-centric scene representations with LLM reasoning, while highlighting detector limitations and avenues for hybrid approaches. Overall, CARTIER enables more natural and effective instruction execution for robot navigation in everyday environments.

Abstract

This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario.

CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots

TL;DR

The paper addresses enabling natural, conversational instruction execution for robot navigation by grounding LLM reasoning in a scene-aware representation. It introduces CARTIER, a five-stage pipeline that detects objects, builds a spatial language index, and uses LLMs to map ambiguous user queries to concrete navigation targets within AI2Thor, then validates against baselines with both perceptual and localization metrics. CARTIER substantially improves localization accuracy, particularly for long, conversational queries, and demonstrates feasibility with a real-world telepresence deployment. The work advances open-vocabulary, language-grounded robotics by integrating object-centric scene representations with LLM reasoning, while highlighting detector limitations and avenues for hybrid approaches. Overall, CARTIER enables more natural and effective instruction execution for robot navigation in everyday environments.

Abstract

This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario.
Paper Structure (18 sections, 5 figures, 1 table)

This paper contains 18 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: CARTIER prompts an LLM with knowledge about a robot's environment in order to parse user intent from implicit, conversational queries. It then informs the robot where to navigate in order to help the user.
  • Figure 2: Overview of CARTIER. Blue boxes indicate processes, while black boxes indicate data. 1. The robot explores its environment during a preprocessing stage, and a trajectory of poses and RGBD frames is collected. 2. In the next stage of preprocessing, an object detector is used to determine which objects are present in the scene. In addition, a spatial language index is built in order to support querying for locations given an object. We explore two such indices: VLMaps vlmaps and a simpler solution which uses the bounding boxes returned by the object detector. 3. User input text is combined with a list of object in the scene to produce an engineered prompt for the LLM 4. The LLM is queried with the constructed prompt. We use ChatGPT and GPT-4 instructgpt in our experiments. 5. The spatial language index is used to look up the location of the object.
  • Figure 3: Comparison of object-matching performance of CARTIER and NLMap across different models and query types. The reported score is the average success rate over all queries and environments.
  • Figure 4: Real-world deployment of CARTIER. Following a conversational query, the robot can navigate to the location (coffee machine) that best satisfies the user's intent.
  • Figure :