Table of Contents
Fetching ...

AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

Dillon Loh, Tomasz Bednarz, Xinxing Xia, Frank Guan

TL;DR

Visual Language Navigation research has largely focused on static scenes, but real indoor settings include moving humans, which AdaVLN addresses. The authors define Adaptive Visual Language Navigation (AdaVLN) and provide AdaSimulator and AdaR2R (Sample) to study navigation with dynamic human obstacles, including a freeze-time mechanism that ensures fair comparison across hardware. They evaluate a baseline GPT-based agent and report high collision rates and perception hallucinations, underscoring the need for better dynamic obstacle understanding and planning. Overall, the work provides new tools and benchmarks to bridge sim-to-real gaps in VLN and to study robust navigation under dynamic, human-filled environments.

Abstract

Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans

TL;DR

Visual Language Navigation research has largely focused on static scenes, but real indoor settings include moving humans, which AdaVLN addresses. The authors define Adaptive Visual Language Navigation (AdaVLN) and provide AdaSimulator and AdaR2R (Sample) to study navigation with dynamic human obstacles, including a freeze-time mechanism that ensures fair comparison across hardware. They evaluate a baseline GPT-based agent and report high collision rates and perception hallucinations, underscoring the need for better dynamic obstacle understanding and planning. Overall, the work provides new tools and benchmarks to bridge sim-to-real gaps in VLN and to study robust navigation under dynamic, human-filled environments.

Abstract

Visual Language Navigation is a task that challenges robots to navigate in realistic environments based on natural language instructions. While previous research has largely focused on static settings, real-world navigation must often contend with dynamic human obstacles. Hence, we propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN), which seeks to narrow this gap. AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles, adding a layer of complexity to navigation tasks that mimic the real-world. To support exploration of this task, we also present AdaVLN simulator and AdaR2R datasets. The AdaVLN simulator enables easy inclusion of fully animated human models directly into common datasets like Matterport3D. We also introduce a "freeze-time" mechanism for both the navigation task and simulator, which pauses world state updates during agent inference, enabling fair comparisons and experimental reproducibility across different hardware. We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.

Paper Structure

This paper contains 17 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Jetbot navigating in a dynamic Matterport3D environment with moving human obstacles.
  • Figure 2: AdaSimulator's GUI Extension in Isaac Sim
  • Figure 3: Top: RGB observations, Bottom: Depth observations provided to agent. Note that the depth observations have been restricted to a range between 0 and 10 in this image for clarity.
  • Figure 4: Top: Environment the 9 navigation episodes were conducted in. Humans loop along the indicated paths infinitely throughtout a navigation episode. Note that the paths have been deliberately chosen to interfere with the optimal path the robot would take.
  • Figure 5: Top: Sample of paths (represented by lines) taken by robots and humans during simulation. Coordinate origins are based on X-Y provided in MP3D GLB files which have been scaled to 1 unit : 1 meter. In cases where the robot's line moves back-and-forth around a point, the robot has gotten stuck in collision with a wall.