NAVCON: A Cognitively Inspired and Linguistically Grounded Corpus for Vision and Language Navigation
Karan Wanchoo, Xiaoye Zuo, Hannah Gonzalez, Soham Dan, Georgios Georgakis, Dan Roth, Kostas Daniilidis, Eleni Miltsakaki
TL;DR
NAVCON delivers a cognitively grounded, linguistically annotated Vision-Language Navigation corpus built on R2R and RxR, introducing four core navigation concepts and generating over 200K concept instantiations aligned with 2.7 million video frames. The authors present a scalable annotation pipeline, validate annotations with human studies, and demonstrate practical usefulness through a dedicated Navigation Concept Classifier and GPT-4o few-shot experiments. The work advances interpretability and cross-modal grounding in VLN by linking high-level navigation concepts to concrete visual observations and providing tools for concept detection in unseen instructions. This resource is poised to improve data efficiency and transparency in vision-language navigation research and applications.
Abstract
We present NAVCON, a large-scale annotated Vision-Language Navigation (VLN) corpus built on top of two popular datasets (R2R and RxR). The paper introduces four core, cognitively motivated and linguistically grounded, navigation concepts and an algorithm for generating large-scale silver annotations of naturally occurring linguistic realizations of these concepts in navigation instructions. We pair the annotated instructions with video clips of an agent acting on these instructions. NAVCON contains 236, 316 concept annotations for approximately 30, 0000 instructions and 2.7 million aligned images (from approximately 19, 000 instructions) showing what the agent sees when executing an instruction. To our knowledge, this is the first comprehensive resource of navigation concepts. We evaluated the quality of the silver annotations by conducting human evaluation studies on NAVCON samples. As further validation of the quality and usefulness of the resource, we trained a model for detecting navigation concepts and their linguistic realizations in unseen instructions. Additionally, we show that few-shot learning with GPT-4o performs well on this task using large-scale silver annotations of NAVCON.
