Table of Contents
Fetching ...

VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

TL;DR

Visual Prompt Navigation (VPN) replaces language instructions with user-provided visual prompts on 2D top-view maps to guide embodied agents. The authors introduce VPNet, a ViT-based baseline that reasons over a topological graph with cross-modal attention to connect prompts to navigable decisions, and they construct two VPN benchmarks, R2R-VP and R2R-CE-VP, by converting VLN episodes and augmenting with PREVALENT and ScaleVLN data. They further propose view- and trajectory-level data augmentation and a DAgger-based training regimen to improve robustness and data efficiency. Experimental results show strong performance gains over VLN baselines, especially with trajectory-level augmentation, demonstrating the practicality and robustness of visual prompts for spatially grounded robot navigation.

Abstract

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

VPN: Visual Prompt Navigation

TL;DR

Visual Prompt Navigation (VPN) replaces language instructions with user-provided visual prompts on 2D top-view maps to guide embodied agents. The authors introduce VPNet, a ViT-based baseline that reasons over a topological graph with cross-modal attention to connect prompts to navigable decisions, and they construct two VPN benchmarks, R2R-VP and R2R-CE-VP, by converting VLN episodes and augmenting with PREVALENT and ScaleVLN data. They further propose view- and trajectory-level data augmentation and a DAgger-based training regimen to improve robustness and data efficiency. Experimental results show strong performance gains over VLN baselines, especially with trajectory-level augmentation, demonstrating the practicality and robustness of visual prompts for spatially grounded robot navigation.

Abstract

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

Paper Structure

This paper contains 22 sections, 11 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Illustration of three types of visual navigation tasks. Compared to both language-based and image-goal instructions, visual prompt instructions provides clearer and more interpretable guidance.
  • Figure 2: Illustration of the process of constructing visual prompts on 2D top-view map. The four subfigures are labeled (a)–(d) from left to right, top to bottom.
  • Figure 3: Illustration of VPNet. "vpm" denotes token corresponding to visual prompts. "stop" indicates the token for the stop action, and "navn" represents token corresponding to navigable candidate.
  • Figure 4: Illustration of prompt-view augmentation (left) and agent-view augmentation (right). The left side shows rotated 2D top-view maps with visual prompts at $0$, $\frac{\pi}{2}$, $\pi$, and $\frac{3\pi}{2}$. The right side shows the corresponding first-person observations when the agent starts with $0$, $\frac{\pi}{2}$, $\pi$, and $\frac{3\pi}{2}$.
  • Figure 5: Illustration of different 2D top-view maps with visual prompts. The six subfigures are labeled (a)–(e) from left to right, top to bottom.
  • ...and 2 more figures