VPN: Visual Prompt Navigation

Shuo Feng; Zihan Wang; Yuchen Li; Rui Kong; Hengyi Cai; Shuaiqiang Wang; Gim Hee Lee; Piji Li; Shuqiang Jiang

VPN: Visual Prompt Navigation

Shuo Feng, Zihan Wang, Yuchen Li, Rui Kong, Hengyi Cai, Shuaiqiang Wang, Gim Hee Lee, Piji Li, Shuqiang Jiang

TL;DR

Visual Prompt Navigation (VPN) replaces language instructions with user-provided visual prompts on 2D top-view maps to guide embodied agents. The authors introduce VPNet, a ViT-based baseline that reasons over a topological graph with cross-modal attention to connect prompts to navigable decisions, and they construct two VPN benchmarks, R2R-VP and R2R-CE-VP, by converting VLN episodes and augmenting with PREVALENT and ScaleVLN data. They further propose view- and trajectory-level data augmentation and a DAgger-based training regimen to improve robustness and data efficiency. Experimental results show strong performance gains over VLN baselines, especially with trajectory-level augmentation, demonstrating the practicality and robustness of visual prompts for spatially grounded robot navigation.

Abstract

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation. The code is available at https://github.com/farlit/VPN.

VPN: Visual Prompt Navigation

TL;DR

Abstract

VPN: Visual Prompt Navigation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)