General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

Bernard Lange; Anil Yildiz; Mansur Arief; Shehryar Khattak; Mykel Kochenderfer; Georgios Georgakis

General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

Bernard Lange, Anil Yildiz, Mansur Arief, Shehryar Khattak, Mykel Kochenderfer, Georgios Georgakis

TL;DR

ARNA addresses generalization in robotic navigation by embedding an LVLM-based agent within an existing robotic stack to generate task-specific workflows through tool use and multimodal memory. It replaces fixed pipelines with an agentic loop that queries perception modules, reasons over multimodal inputs, and issues navigation actions, enabling robust exploration in unmapped environments. Evaluations in Habitat HM-EQA show state-of-the-art performance and efficient exploration, with qualitative results on RxR-CE and custom tasks illustrating broad generalization. The work highlights a scalable blueprint for integrating LVLM reasoning into robotics, though it notes significant compute costs and areas for improvement in memory verification and online re-planning.

Abstract

Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed information flows, limiting their generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge for reasoning and planning, but prior LVLM-robot integrations have largely depended on pre-mapped spaces, hard-coded representations, and rigid control logic. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools drawn from modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, and select navigation actions. This agentic formulation enables robust navigation and reasoning in previously unmapped environments, offering a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA outperforms state-of-the-art EQA-specific approaches. Qualitative results on RxR and custom tasks further demonstrate its ability to generalize across a broad range of navigation challenges.

General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

TL;DR

Abstract

General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)