Table of Contents
Fetching ...

View Invariant Learning for Vision-Language Navigation in Continuous Environments

Josh Qixuan Sun, Huaiyuan Weng, Xiaoying Xing, Chul Min Yeum, Mark Crowley

TL;DR

This paper proposes VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint and introduces a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines.

Abstract

Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V$^2$-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components. Empirical results show that our method outperforms state-of-the-art approaches on V$^2$-VLNCE by 8-15\% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Evaluation of VIL in standard VLNCE settings shows that despite being trained for varied viewpoints, VIL often still improves performance. On the harder RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method. We further evaluate VIL for simulated camera placements derived from real robot configurations (e.g. Stretch RE-1, LoCoBot), showing consistent improvements of performance. Finally, we present a proof-of-concept real-robot evaluation in two physical environments using a panoramic RGB sensor combined with LiDAR. The code is available at https://github.com/realjoshqsun/V2-VLNCE.

View Invariant Learning for Vision-Language Navigation in Continuous Environments

TL;DR

This paper proposes VIL (View Invariant Learning), a view-invariant post-training strategy that enhances the robustness of existing navigation policies to changes in camera viewpoint and introduces a teacher-student framework for the Waypoint Predictor Module, a core component of most VLNCE baselines.

Abstract

Vision-Language Navigation in Continuous Environments (VLNCE), where an agent follows instructions and moves freely to reach a destination, is a key research problem in embodied AI. However, most existing approaches are sensitive to viewpoint changes, i.e. variations in camera height and viewing angle. Here we introduce a more general scenario, V-VLNCE (VLNCE with Varied Viewpoints) and propose a view-invariant post-training framework, called VIL (View Invariant Learning), that makes existing navigation policies more robust to changes in camera viewpoint. VIL employs a contrastive learning framework to learn sparse and view-invariant features. We also introduce a teacher-student framework for the Waypoint Predictor Module, a standard part of VLNCE baselines, where a view-dependent teacher model distills knowledge into a view-invariant student model. We employ an end-to-end training paradigm to jointly optimize these components. Empirical results show that our method outperforms state-of-the-art approaches on V-VLNCE by 8-15\% measured on Success Rate for two standard benchmark datasets R2R-CE and RxR-CE. Evaluation of VIL in standard VLNCE settings shows that despite being trained for varied viewpoints, VIL often still improves performance. On the harder RxR-CE dataset, our method also achieved state-of-the-art performance across all metrics. This suggests that adding VIL does not diminish the standard viewpoint performance and can serve as a plug-and-play post-training method. We further evaluate VIL for simulated camera placements derived from real robot configurations (e.g. Stretch RE-1, LoCoBot), showing consistent improvements of performance. Finally, we present a proof-of-concept real-robot evaluation in two physical environments using a panoramic RGB sensor combined with LiDAR. The code is available at https://github.com/realjoshqsun/V2-VLNCE.

Paper Structure

This paper contains 16 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Comparison between standard VLNCE and our proposed V$^2$-VLNCE. Under viewpoint changes, baseline navigation policies suffer from degraded performance. Applying View Invariant Learning (VIL) significantly improves robustness, enabling the agent to navigate under varied viewpoints.
  • Figure 2: Overview of our view-invariant learning framework. (a) Training Phase: Given standard and varied viewpoints, the image encoder extracts features for both. A contrastive learning objective is applied to align representations across viewpoints and encourage view-invariant features. Meanwhile, a teacher-student framework is used for waypoint prediction, where a frozen teacher processes standard views and a student model adapts to varied views by training only a lightweight adapter module. (b) Inference Phase: Only the student model is used to predict waypoints under varied viewpoints. (c) ETPNav baseline: A standard VLNCE architecture without contrastive learning or teacher-student training.
  • Figure 3: Detailed architecture of the waypoint predictor student module used in teacher–student distillation.
  • Figure 4: The robot platform used in our experiments.
  • Figure 5: Real world demo of our proposed VIL.