Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Heng Li; Minghan Li; Zhi-Qi Cheng; Yifei Dong; Yuxuan Zhou; Jun-Yan He; Qi Dai; Teruko Mitamura; Alexander G. Hauptmann

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Heng Li, Minghan Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, Qi Dai, Teruko Mitamura, Alexander G. Hauptmann

TL;DR

This work presents the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments.

Abstract

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

TL;DR

Abstract

Paper Structure (29 sections, 12 equations, 18 figures, 11 tables, 2 algorithms)

This paper contains 29 sections, 12 equations, 18 figures, 11 tables, 2 algorithms.

Introduction
Human-Aware Vision-and-Language Navigation
HA3D Simulator: Integrating Dynamic Human Activities
Human-Aware Navigation Agents
Experiments
Evaluation Protocol for HA-VLN Task
Evaluating HA-VLN Assumptions
Evaluation of SOTA VLN Agents on the HA-VLN Task
Evaluation of Agents on HA-VLN Task
Evaluation on Real-World Robots
Discussion
Conclusion
Related Work
Simulator Details
HAPS Dataset
...and 14 more sections

Figures (18)

Figure 1: HA-VLN Scenario: The agent navigates through environments populated with dynamic human activities. The task involves optimizing routes while maintaining safe distances from humans to address the Sim2Real gap. In this scenario, the agent encounters various human activities, such as someone talking on the phone while pacing in the hallway, someone taking off their shoes in the entryway/foyer, and someone carrying groceries upstairs. The HA-VLN agent must adapt its path by waiting for humans to move, adjusting its path, or proceeding through when clear, thereby enhancing real-world applicability.
Figure 2: Human-Aware 3D (HA3D) Simulator Annotation Process: HA3D integrates dynamic human activities from the Human Activity and Pose Simulation (HAPS) dataset into the photorealistic environments of Matterport3D. The annotation process involves: (1) integrating the HAPS dataset, which includes 145 human activity descriptions converted into 435 detailed 3D human motion models in 52,200 frames; (2) annotating human activities within various indoor regions across 90 building scenes using an interactive annotation tool; (3) rendering realistic human models; and (4) enabling interactive agent-environment interactions.
Figure 3: Dataset Analysis of HA-R2R:(A) Impact of human activities on instruction length, tokenized using NLTK WordNet, showing the variation in instruction length caused by different types of human activities. (B) Comparison of instruction length distributions between HA-R2R and the original R2R dataset. HA-R2R demonstrates a more uniform distribution, facilitating balanced training. (C) Analysis of viewpoints affected by human activities: "Visible" denotes activities within the agent's sight, "Isolated" refers to key navigation nodes impacted by human activities, and "Occupied" indicates the presence of humans at specific viewpoints.
Figure 4: Model Architectures of Navigation Agents: The architectures of the Vision-Language Navigation Cross-Modal (VLN-CM) agent (left) and the Vision-Language Navigation Decision Transformer (VLN-DT) agent (right). Both agents employ a cross-modality fusion module to effectively integrate visual and linguistic information for predicting navigation actions. VLN-CM utilizes an LSTM-based sequence-to-sequence model for expert-supervised learning, while VLN-DT leverages an autoregressive transformer model to learn from random trajectories without expert supervision.
Figure 5: Effects of Reward Strategies on VLN-DT.
...and 13 more figures

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

TL;DR

Abstract

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (18)