Table of Contents
Fetching ...

A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation

Nivedan Yakolli, Avinash Gautam, Abhijit Das, Yuankai Qi, Virendra Singh Shekhawat

TL;DR

The paper surveys Vision-and-Language Navigation (VLN) within robotics, emphasizing human-robot interaction and multi-agent collaboration. It traces VLN from foundational datasets and simulators to modern LLM-enabled reasoning, memory, and multi-robot coordination, outlining prevailing methods and metrics. Key contributions include a taxonomy of progress across single-agent to multi-agent VLN, critical evaluation of benchmarks, and forward-looking directions such as proactive clarification, decentralized coordination, and robust sim-to-real transfer. The work highlights practical implications for deploying VLN-enabled robots in domains like healthcare, logistics, and disaster response.

Abstract

Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.

A Survey on Improving Human Robot Collaboration through Vision-and-Language Navigation

TL;DR

The paper surveys Vision-and-Language Navigation (VLN) within robotics, emphasizing human-robot interaction and multi-agent collaboration. It traces VLN from foundational datasets and simulators to modern LLM-enabled reasoning, memory, and multi-robot coordination, outlining prevailing methods and metrics. Key contributions include a taxonomy of progress across single-agent to multi-agent VLN, critical evaluation of benchmarks, and forward-looking directions such as proactive clarification, decentralized coordination, and robust sim-to-real transfer. The work highlights practical implications for deploying VLN-enabled robots in domains like healthcare, logistics, and disaster response.

Abstract

Vision-and-Language Navigation (VLN) is a multi-modal, cooperative task requiring agents to interpret human instructions, navigate 3D environments, and communicate effectively under ambiguity. This paper presents a comprehensive review of recent VLN advancements in robotics and outlines promising directions to improve multi-robot coordination. Despite progress, current models struggle with bidirectional communication, ambiguity resolution, and collaborative decision-making in the multi-agent systems. We review approximately 200 relevant articles to provide an in-depth understanding of the current landscape. Through this survey, we aim to provide a thorough resource that inspires further research at the intersection of VLN and robotics. We advocate that the future VLN systems should support proactive clarification, real-time feedback, and contextual reasoning through advanced natural language understanding (NLU) techniques. Additionally, decentralized decision-making frameworks with dynamic role assignment are essential for scalable, efficient multi-robot collaboration. These innovations can significantly enhance human-robot interaction (HRI) and enable real-world deployment in domains such as healthcare, logistics, and disaster response.

Paper Structure

This paper contains 16 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Schematic representation of VLN task as an interactive navigation via natural languagegu2022vision . The embodied agent perceives and acts within a 3D environment, while the oracle provides language-based guidance. Both the agent and oracle observe the environment and exchange information through natural language communication to achieve navigation objectives.
  • Figure 2: Overview of this survey.
  • Figure 4: Prominent VLN Datasets and Environments for Visual Navigation.
  • Figure 5: Hierarchical structure of the Matterport3D (MP3D) dataset and its derived VLN resources.
  • Figure 6: Hierarchical structure of the prominent VLN outdoor datasets.
  • ...and 8 more figures