Table of Contents
Fetching ...

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

Yifei Dong, Fengyi Wu, Qi He, Zhi-Qi Cheng, Heng Li, Minghan Li, Zebang Cheng, Yuxuan Zhou, Jingdong Sun, Qi Dai, Alexander G Hauptmann

TL;DR

HA-VLN 2.0 presents a unified benchmark for human-aware Vision-and-Language Navigation that bridges discrete and continuous navigation in dynamic, multi-human environments. It introduces HAPS 2.0, two simulators (HA-VLN-DE/CE), a unified API, and 16,844 socially grounded instructions drawn from HA-R2R, with 910 annotated humans across 428 regions to capture realistic social dynamics. Two baseline agents (HA-VLN-VL and HA-VLN-CMA) and a sim-to-real validation on a real robot demonstrate that explicit social modeling improves robustness and reduces collisions, while an open leaderboard enables transparent comparison. By releasing comprehensive datasets, simulators, baselines, and evaluation protocols, HA-VLN 2.0 provides a robust foundation for safe, socially aware navigation research and real-world deployment.

Abstract

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous settings, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring the necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, socially responsible navigation research.

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions

TL;DR

HA-VLN 2.0 presents a unified benchmark for human-aware Vision-and-Language Navigation that bridges discrete and continuous navigation in dynamic, multi-human environments. It introduces HAPS 2.0, two simulators (HA-VLN-DE/CE), a unified API, and 16,844 socially grounded instructions drawn from HA-R2R, with 910 annotated humans across 428 regions to capture realistic social dynamics. Two baseline agents (HA-VLN-VL and HA-VLN-CMA) and a sim-to-real validation on a real robot demonstrate that explicit social modeling improves robustness and reduces collisions, while an open leaderboard enables transparent comparison. By releasing comprehensive datasets, simulators, baselines, and evaluation protocols, HA-VLN 2.0 provides a robust foundation for safe, socially aware navigation research and real-world deployment.

Abstract

Vision-and-Language Navigation (VLN) has been studied mainly in either discrete or continuous settings, with little attention to dynamic, crowded environments. We present HA-VLN 2.0, a unified benchmark introducing explicit social-awareness constraints. Our contributions are: (i) a standardized task and metrics capturing both goal accuracy and personal-space adherence; (ii) HAPS 2.0 dataset and simulators modeling multi-human interactions, outdoor contexts, and finer language-motion alignment; (iii) benchmarks on 16,844 socially grounded instructions, revealing sharp performance drops of leading agents under human dynamics and partial observability; and (iv) real-world robot experiments validating sim-to-real transfer, with an open leaderboard enabling transparent comparison. Results show that explicit social modeling improves navigation robustness and reduces collisions, underscoring the necessity of human-centric approaches. By releasing datasets, simulators, baselines, and protocols, HA-VLN 2.0 provides a strong foundation for safe, socially responsible navigation research.

Paper Structure

This paper contains 35 sections, 23 equations, 20 figures, 10 tables, 2 algorithms.

Figures (20)

  • Figure 1: HA-VLN 2.0 Navigation Scenario. HA-VLN 2.0 adds four key challenges: (i) unified discrete/continuous navigation with denser crowds, richer activities, and mixed indoor–outdoor scenes; (ii) stricter social-distance and collision constraints under partial observability; (iii) instructions explicitly grounded in human activities and spatial cues, improving language–vision alignment; and (iv) robust real-time planning amid occlusion and multi-human dynamics. Example: key positions (e.g., ➀, ➁) align with instructional cues referring to specific human behaviors. When the agent encounters a bystander on the phone (➁, Decision A), it intelligently turns right to avert a potential collision. On the right, RGB and Depth observations illustrate the agent’s panoramic view preceding decisions A, B, and C, capturing its dynamic responses to nearby humans.
  • Figure 2: HA-VLN Simulator. Unlike HA3D, which modeled sparse and static human activities in discrete settings, HA-VLN incorporates rich and dynamic human behaviors using HAPS 2.0 (172 activities, 486 models, 58k frames). Annotation involves two stages: (i) coarse-to-fine optimization via PSO and multi-view camera setups, and (ii) human-in-the-loop refinement for realistic crowd dynamics. Real-time rendering updates motions through a signaling mechanism, facilitating collision detection and dynamic agent–environment interactions. These improvements bridge discrete evaluation (DE) and realistic continuous navigation (CE), establishing a robust foundation for benchmarks in socially intelligent navigation.
  • Figure 3: Motion Analysis.(a) Top three motions from Stage 1 (without enrichment) and Stage 2 (with enrichment). (b) Overall activity statistics, comparing interaction types, movement distances, and the number of models. Enrichment expands both the variety and dynamic range of human activities.
  • Figure 4: HA-R2R Dataset Analysis.(a) Distribution of instruction length by human group size (none to $>$3). (b) Comparison of instruction lengths between HA-R2R and R2R-CE.
  • Figure 5: Agent Trajectory Examples (HA-VLN-CMA$^{*}$). The top row demonstrates a failed navigation scenario where the agent fails to avoid an oncoming human, ultimately resulting in a collision. In contrast, the bottom row showcases a successful navigation: the agent proactively adjusts its trajectory to the left, effectively avoiding human interference and completing the task without collision.
  • ...and 15 more figures