Table of Contents
Fetching ...

Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments

Zerui Li, Gengze Zhou, Haodong Hong, Yanyan Shao, Wenqi Lyu, Yanyuan Qiao, Qi Wu

TL;DR

The paper tackles the generalization gap in Vision-and-Language Navigation when transitioning from human high-vision guidance to low-height robot viewpoints in continuous environments. It introduces Ground-level Viewpoint Navigation (GVNav), which fuses weighted historical observations for robust spatiotemporal context, scales waypoint prediction through large, ground-level datasets, and employs an adaptive, multi-view transformer to mitigate occlusions within a topological-map planning framework. A cross-dataset connectivity graph transfer from HM3D and Gibson provides stronger spatial priors, improving both simulation and real-world performance on a quadruped robot. Real-world validation on a Xiaomi Cyberdog with panoramic RGBD input demonstrates the approach's practicality and effectiveness in diverse environments, underscoring the importance of aligning human instructions with ground-level perception for robust VLN in robotics.

Abstract

Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.

Ground-level Viewpoint Vision-and-Language Navigation in Continuous Environments

TL;DR

The paper tackles the generalization gap in Vision-and-Language Navigation when transitioning from human high-vision guidance to low-height robot viewpoints in continuous environments. It introduces Ground-level Viewpoint Navigation (GVNav), which fuses weighted historical observations for robust spatiotemporal context, scales waypoint prediction through large, ground-level datasets, and employs an adaptive, multi-view transformer to mitigate occlusions within a topological-map planning framework. A cross-dataset connectivity graph transfer from HM3D and Gibson provides stronger spatial priors, improving both simulation and real-world performance on a quadruped robot. Real-world validation on a Xiaomi Cyberdog with panoramic RGBD input demonstrates the approach's practicality and effectiveness in diverse environments, underscoring the importance of aligning human instructions with ground-level perception for robust VLN in robotics.

Abstract

Vision-and-Language Navigation (VLN) empowers agents to associate time-sequenced visual observations with corresponding instructions to make sequential decisions. However, generalization remains a persistent challenge, particularly when dealing with visually diverse scenes or transitioning from simulated environments to real-world deployment. In this paper, we address the mismatch between human-centric instructions and quadruped robots with a low-height field of view, proposing a Ground-level Viewpoint Navigation (GVNav) approach to mitigate this issue. This work represents the first attempt to highlight the generalization gap in VLN across varying heights of visual observation in realistic robot deployments. Our approach leverages weighted historical observations as enriched spatiotemporal contexts for instruction following, effectively managing feature collisions within cells by assigning appropriate weights to identical features across different viewpoints. This enables low-height robots to overcome challenges such as visual obstructions and perceptual mismatches. Additionally, we transfer the connectivity graph from the HM3D and Gibson datasets as an extra resource to enhance spatial priors and a more comprehensive representation of real-world scenarios, leading to improved performance and generalizability of the waypoint predictor in real-world environments. Extensive experiments demonstrate that our Ground-level Viewpoint Navigation (GVnav) approach significantly improves performance in both simulated environments and real-world deployments with quadruped robots.

Paper Structure

This paper contains 15 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: There is a significant viewpoint height discrepancy between humans and the robot dog (Up: human, Down: dog). Humans typically have a much higher line of sight compared to the robot dog. Our waypoint prediction network could provide robust prediction under a low line of sight.
  • Figure 2: Multi-view Information Gathering emphasizes more informative features for the current context, enabling adaptive selection of the visual representations from multiple viewpoints (A and B). The navigation policy identifies the optimal next viewpoint in the topological graph (selecting C as the next viewpoint after A). This prediction is based not only on the robot's current observation at A, but also on previous, unobstructed views (from B), allowing the robot to mitigate occlusions and plan more robust navigation strategies.
  • Figure 3: Real-world demo of our proposed Ground-View approach, for vision-and-language navigation. Given the human instruction, GVNav only takes 12 RGBD images as input and outputs a predicted waypoint for robotic execution.