Table of Contents
Fetching ...

General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

Bernard Lange, Anil Yildiz, Mansur Arief, Shehryar Khattak, Mykel Kochenderfer, Georgios Georgakis

TL;DR

ARNA addresses generalization in robotic navigation by embedding an LVLM-based agent within an existing robotic stack to generate task-specific workflows through tool use and multimodal memory. It replaces fixed pipelines with an agentic loop that queries perception modules, reasons over multimodal inputs, and issues navigation actions, enabling robust exploration in unmapped environments. Evaluations in Habitat HM-EQA show state-of-the-art performance and efficient exploration, with qualitative results on RxR-CE and custom tasks illustrating broad generalization. The work highlights a scalable blueprint for integrating LVLM reasoning into robotics, though it notes significant compute costs and areas for improvement in memory verification and online re-planning.

Abstract

Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed information flows, limiting their generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge for reasoning and planning, but prior LVLM-robot integrations have largely depended on pre-mapped spaces, hard-coded representations, and rigid control logic. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools drawn from modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, and select navigation actions. This agentic formulation enables robust navigation and reasoning in previously unmapped environments, offering a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA outperforms state-of-the-art EQA-specific approaches. Qualitative results on RxR and custom tasks further demonstrate its ability to generalize across a broad range of navigation challenges.

General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting

TL;DR

ARNA addresses generalization in robotic navigation by embedding an LVLM-based agent within an existing robotic stack to generate task-specific workflows through tool use and multimodal memory. It replaces fixed pipelines with an agentic loop that queries perception modules, reasons over multimodal inputs, and issues navigation actions, enabling robust exploration in unmapped environments. Evaluations in Habitat HM-EQA show state-of-the-art performance and efficient exploration, with qualitative results on RxR-CE and custom tasks illustrating broad generalization. The work highlights a scalable blueprint for integrating LVLM reasoning into robotics, though it notes significant compute costs and areas for improvement in memory verification and online re-planning.

Abstract

Developing general-purpose navigation policies for unknown environments remains a core challenge in robotics. Most existing systems rely on task-specific neural networks and fixed information flows, limiting their generalizability. Large Vision-Language Models (LVLMs) offer a promising alternative by embedding human-like knowledge for reasoning and planning, but prior LVLM-robot integrations have largely depended on pre-mapped spaces, hard-coded representations, and rigid control logic. We introduce the Agentic Robotic Navigation Architecture (ARNA), a general-purpose framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools drawn from modern robotic stacks. At runtime, the agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, and select navigation actions. This agentic formulation enables robust navigation and reasoning in previously unmapped environments, offering a new perspective on robotic stack design. Evaluated in Habitat Lab on the HM-EQA benchmark, ARNA outperforms state-of-the-art EQA-specific approaches. Qualitative results on RxR and custom tasks further demonstrate its ability to generalize across a broad range of navigation challenges.

Paper Structure

This paper contains 24 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: While classical robotic stacks (top left) and recent LVLM-based solutions (bottom left) follow fixed pipelines, ARNA (right) introduces a robotic agent that orchestrates perception, reasoning, and navigation tools to accomplish diverse, language-defined tasks.
  • Figure 2: ARNA is a navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools from a modern robotic stack. The agent autonomously defines and executes task-specific workflows that iteratively query modules, reason over multimodal inputs, choose navigation actions, and update its memory to fulfill any provided task.
  • Figure 3: (Top) Visualization of plan generation with examples inspired by the self-discover approach zhou2024self. (Bottom) Visualization of the memory update process with examples. Both the plan and other workflow components, together with memory, are then used to guide plan execution, as illustrated. Each step is accompanied by dedicated prompts that describe the intended usage and examples for the LVLM.
  • Figure 4: Examples on HM-EQA (top) and RxR (bottom) tasks, showing key decisions and findings with reference to the occupancy grids. The LVLM’s outputs provide insight into the agent’s reasoning behind these decisions. In the HM-EQA, the agent systematically explores the environment to visually confirm the presence of a banjo in a bedroom (top), while in RxR it navigates to locate a toilet (bottom).