VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

Sihao Lin; Zerui Li; Xunyi Zhao; Gengze Zhou; Liuyi Wang; Rong Wei; Rui Tang; Juncheng Li; Hanqing Wang; Jiangmiao Pang; Anton van den Hengel; Jiajun Liu; Qi Wu

VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

Sihao Lin, Zerui Li, Xunyi Zhao, Gengze Zhou, Liuyi Wang, Rong Wei, Rui Tang, Juncheng Li, Hanqing Wang, Jiangmiao Pang, Anton van den Hengel, Jiajun Liu, Qi Wu

TL;DR

VLNVerse addresses core VLN limitations by providing a physics-aware, large-scale, and extensible benchmark built on NVIDIA Isaac Sim. It unifies diverse VLN tasks into a single framework across three layers (Agent/Simulator, World, Benchmark) and introduces a scalable data pipeline with 263 interactive USD environments. The framework also presents GAMA, a unified multi-task navigation model using State-Adaptive MoE (SAME) to enable cross-task knowledge transfer. Extensive experiments show the benchmark's difficulty and the value of online fine-tuning and dialogue-based guidance for improving embodied navigation, highlighting its potential to bridge sim-to-real gaps and accelerate multi-task embodied AI research.

Abstract

Despite remarkable progress in Vision-Language Navigation (VLN), existing benchmarks remain confined to fixed, small-scale datasets with naive physical simulation. These shortcomings limit the insight that the benchmarks provide into sim-to-real generalization, and create a significant research gap. Furthermore, task fragmentation prevents unified/shared progress in the area, while limited data scales fail to meet the demands of modern LLM-based pretraining. To overcome these limitations, we introduce VLNVerse: a new large-scale, extensible benchmark designed for Versatile, Embodied, Realistic Simulation, and Evaluation. VLNVerse redefines VLN as a scalable, full-stack embodied AI problem. Its Versatile nature unifies previously fragmented tasks into a single framework and provides an extensible toolkit for researchers. Its Embodied design moves beyond intangible and teleporting "ghost" agents that support full-kinematics in a Realistic Simulation powered by a robust physics engine. We leverage the scale and diversity of VLNVerse to conduct a comprehensive Evaluation of existing methods, from classic models to MLLM-based agents. We also propose a novel unified multi-task model capable of addressing all tasks within the benchmark. VLNVerse aims to narrow the gap between simulated navigation and real-world generalization, providing the community with a vital tool to boost research towards scalable, general-purpose embodied locomotion agents.

VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

TL;DR

Abstract

VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)