CARTIER: Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots
Dmitriy Rivkin, Nikhil Kakodkar, Francois Hogan, Bobak H. Baghi, Gregory Dudek
TL;DR
The paper addresses enabling natural, conversational instruction execution for robot navigation by grounding LLM reasoning in a scene-aware representation. It introduces CARTIER, a five-stage pipeline that detects objects, builds a spatial language index, and uses LLMs to map ambiguous user queries to concrete navigation targets within AI2Thor, then validates against baselines with both perceptual and localization metrics. CARTIER substantially improves localization accuracy, particularly for long, conversational queries, and demonstrates feasibility with a real-world telepresence deployment. The work advances open-vocabulary, language-grounded robotics by integrating object-centric scene representations with LLM reasoning, while highlighting detector limitations and avenues for hybrid approaches. Overall, CARTIER enables more natural and effective instruction execution for robot navigation in everyday environments.
Abstract
This work explores the capacity of large language models (LLMs) to address problems at the intersection of spatial planning and natural language interfaces for navigation. We focus on following complex instructions that are more akin to natural conversation than traditional explicit procedural directives typically seen in robotics. Unlike most prior work where navigation directives are provided as simple imperative commands (e.g., "go to the fridge"), we examine implicit directives obtained through conversational interactions.We leverage the 3D simulator AI2Thor to create household query scenarios at scale, and augment it by adding complex language queries for 40 object types. We demonstrate that a robot using our method CARTIER (Cartographic lAnguage Reasoning Targeted at Instruction Execution for Robots) can parse descriptive language queries up to 42% more reliably than existing LLM-enabled methods by exploiting the ability of LLMs to interpret the user interaction in the context of the objects in the scenario.
