Table of Contents
Fetching ...

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

Guangzhao Dai, Jian Zhao, Yuantao Chen, Yusen Qin, Hao Zhao, Guosen Xie, Yazhou Yao, Xiangbo Shu, Xuelong Li

TL;DR

UnitedVLN tackles Vision-Language Navigation in Continuous Environments by jointly rendering high-fidelity 360° appearance and semantic information from sparse neural points. It introduces a generalizable 3D Gaussian Splatting (3DGS) pre-training framework with two novel schemes: Search-Then-Query (STQ) for efficient neural point sampling and Separate-Then-United (STU) rendering to fuse NeRF-based semantic rendering with 3DGS appearance rendering. The approach yields state-of-the-art results on VLN-CE benchmarks (R2R-CE, RxR-CE), demonstrates strong generalization to other VLN-CE models, and achieves substantially faster rendering than prior NeRF-based methods. By uniting appearance-level cues with high-level semantic information, UnitedVLN enhances robustness against occlusions and ambiguities, enabling more accurate and interpretable navigation in complex indoor environments.

Abstract

Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation

TL;DR

UnitedVLN tackles Vision-Language Navigation in Continuous Environments by jointly rendering high-fidelity 360° appearance and semantic information from sparse neural points. It introduces a generalizable 3D Gaussian Splatting (3DGS) pre-training framework with two novel schemes: Search-Then-Query (STQ) for efficient neural point sampling and Separate-Then-United (STU) rendering to fuse NeRF-based semantic rendering with 3DGS appearance rendering. The approach yields state-of-the-art results on VLN-CE benchmarks (R2R-CE, RxR-CE), demonstrates strong generalization to other VLN-CE models, and achieves substantially faster rendering than prior NeRF-based methods. By uniting appearance-level cues with high-level semantic information, UnitedVLN enhances robustness against occlusions and ambiguities, enabling more accurate and interpretable navigation in complex indoor environments.

Abstract

Vision-and-Language Navigation (VLN), where an agent follows instructions to reach a target destination, has recently seen significant advancements. In contrast to navigation in discrete environments with predefined trajectories, VLN in Continuous Environments (VLN-CE) presents greater challenges, as the agent is free to navigate any unobstructed location and is more vulnerable to visual occlusions or blind spots. Recent approaches have attempted to address this by imagining future environments, either through predicted future visual images or semantic features, rather than relying solely on current observations. However, these RGB-based and feature-based methods lack intuitive appearance-level information or high-level semantic complexity crucial for effective navigation. To overcome these limitations, we introduce a novel, generalizable 3DGS-based pre-training paradigm, called UnitedVLN, which enables agents to better explore future environments by unitedly rendering high-fidelity 360 visual images and semantic features. UnitedVLN employs two key schemes: search-then-query sampling and separate-then-united rendering, which facilitate efficient exploitation of neural primitives, helping to integrate both appearance and semantic information for more robust navigation. Extensive experiments demonstrate that UnitedVLN outperforms state-of-the-art methods on existing VLN-CE benchmarks.

Paper Structure

This paper contains 51 sections, 25 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Main insights of UnitedVLN for VLN-CE: Unlike existing state-of-the-art methods that explore only either predicted images or features in future environments, our UnitedVLN fully integrates navigation cues from both appearance and semantic information. By leveraging two complementary rendering strategies—(1) appearance-level rendering (e.g., distinct colors) and (2) semantic-level rendering (e.g., appearance-similar elements like doors)—UnitedVLN enhances the agent's ability to interpret instructions accurately and navigate complex spaces. The visualizations show how UnitedVLN's combined approach results in more accurate route choices, reducing errors caused by occlusions or ambiguities that challenge purely RGB- or feature-based methods.
  • Figure 2: Overall framework of UnitedVLN. UnitedVLN obtains full higher-fidelity 360° visual observations, i.e., visual images and semantic features, through three stages: Initialization, Querying, and Rendering. In Initialization, it encodes the existing observed environments, i.e., visited and current observations, into the point cloud and feature cloud. In Querying, it adopts a Search-Then-Query sampling (STQ) scheme for efficient neural points sampling. Specifically, for any neural points in the feature/point cloud, it searches for each point in its neighborhood and queries its K-nearest points. Then, the sampled neural points in the feature/point cloud are fed into MLP to regress neural radiance, volume density, and images/feature Gaussians, respectively. In Rendering, for the neural radiance and images/feature Gaussians of the previous stage, it adopts a separate-then-united rendering (STU) scheme to render semantic features with high-level information via NeRF, and the visual image (interacted by 3DGS-rendered feature) with appearance-level information via 3DGS. Finally, the NeRF-rendered features and 3DGS-rendered image are integrated to ahead represent the semantic information in future environments.
  • Figure 3: Visualization example of navigation strategy on the val unseen split of the R2R-CE dataset. (a) denotes the navigation strategy of the baseline model. (b) denotes the RGB-united-Feature exploration strategy of our unitedVLN.
  • Figure 4: Visualization example of RGB reconstruction for candidate locations using the UnitedVLN model. "GT" and "Pred" denote ground-truth images and rendered images by our pre-training method, respectively.
  • Figure 5: Ablation study of numbers of K in NeRF and 3DGS.
  • ...and 3 more figures