Table of Contents
Fetching ...

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

Zihan Wang, Yaohui Zhu, Gim Hee Lee, Yachun Fan

TL;DR

NavRAG tackles data scarcity in Vision-Language Navigation by using a hierarchical scene description tree and retrieval-augmented LLMs to generate diverse user-demand instructions. It builds and annotates over 2 million navigation instructions across 861 scenes, enabling large-scale pretraining that improves zero-shot and fine-tuned VLN performance while aligning instruction style with real user demands. The framework combines global planning with local scene detail through a zone-based partitioning scheme and role-based instruction generation, enhancing generalization to unseen environments. While effective, it notes limitations in evaluating instruction correctness and restricting targets to viewpoints rather than object-centric tasks, suggesting directions for more robust, object-aware VLN data generation.

Abstract

Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models.

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

TL;DR

NavRAG tackles data scarcity in Vision-Language Navigation by using a hierarchical scene description tree and retrieval-augmented LLMs to generate diverse user-demand instructions. It builds and annotates over 2 million navigation instructions across 861 scenes, enabling large-scale pretraining that improves zero-shot and fine-tuned VLN performance while aligning instruction style with real user demands. The framework combines global planning with local scene detail through a zone-based partitioning scheme and role-based instruction generation, enhancing generalization to unseen environments. While effective, it notes limitations in evaluating instruction correctness and restricting targets to viewpoints rather than object-centric tasks, suggesting directions for more robust, object-aware VLN data generation.

Abstract

Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users' communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models.

Paper Structure

This paper contains 15 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The comparison of previous navigation instruction generation methods (a) and NavRAG (b).
  • Figure 2: Demonstration of the Scene Description Tree. Based on LLM, NavRAG constructs the scene description tree in a bottom-up manner, progressively constructing from objects to views, viewpoints, zones, and the overall scene. This hierarchical structure describes environmental semantics and spatial relationships at different levels, facilitating LLM in understanding 3D environments and retrieving information for instruction generation.
  • Figure 3: Framework of the zone partitioning algorithm based on connectivity relations and environmental semantics.
  • Figure 4: Framework of NavRAG for scene tree construction and navigation instruction generation through Retrieval-Augmented LLM.
  • Figure 5: Prompt, input and output of the Rough Instruction Generator and Hierarchical Retrieval.
  • ...and 2 more figures