Table of Contents
Fetching ...

GaussNav: Gaussian Splatting for Visual Navigation

Xiaohan Lei, Min Wang, Wengang Zhou, Houqiang Li

TL;DR

GaussNav tackles Instance ImageGoal Navigation by replacing traditional BEV maps with a 3D Gaussian Splatting–based Semantic Gaussian that preserves geometry, semantics, and texture. The method grounds the target object through rendering descriptive views of candidate instances and robustly matches them to the goal image, effectively reframing IIN as a tractable point-goal task. A three-stage pipeline—Frontier Exploration, Semantic Gaussian Construction, and Gaussian Navigation—yields state-of-the-art performance on HM3D with SPL up to $0.578$ and over 20 FPS, while ablations highlight the importance of classification, matching, and novel view synthesis. This work advances instance-level visual navigation by leveraging differentiable rendering and a composable 3D Gaussian representation to retain texture details critical for distinguishing objects across viewpoints.

Abstract

In embodied vision, Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. The primary challenge of IIN arises from the need to recognize the target object across varying viewpoints while ignoring potential distractors. Existing map-based navigation methods typically use Bird's Eye View (BEV) maps, which lack detailed texture representation of a scene. Consequently, while BEV maps are effective for semantic-level visual navigation, they are struggling for instance-level tasks. To this end, we propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS). The GaussNav framework enables the agent to memorize both the geometry and semantic information of the scene, as well as retain the textural features of objects. By matching renderings of similar objects with the target, the agent can accurately identify, ground, and navigate to the specified object. Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset. The source code is publicly available at the link: https://github.com/XiaohanLei/GaussNav.

GaussNav: Gaussian Splatting for Visual Navigation

TL;DR

GaussNav tackles Instance ImageGoal Navigation by replacing traditional BEV maps with a 3D Gaussian Splatting–based Semantic Gaussian that preserves geometry, semantics, and texture. The method grounds the target object through rendering descriptive views of candidate instances and robustly matches them to the goal image, effectively reframing IIN as a tractable point-goal task. A three-stage pipeline—Frontier Exploration, Semantic Gaussian Construction, and Gaussian Navigation—yields state-of-the-art performance on HM3D with SPL up to and over 20 FPS, while ablations highlight the importance of classification, matching, and novel view synthesis. This work advances instance-level visual navigation by leveraging differentiable rendering and a composable 3D Gaussian representation to retain texture details critical for distinguishing objects across viewpoints.

Abstract

In embodied vision, Instance ImageGoal Navigation (IIN) requires an agent to locate a specific object depicted in a goal image within an unexplored environment. The primary challenge of IIN arises from the need to recognize the target object across varying viewpoints while ignoring potential distractors. Existing map-based navigation methods typically use Bird's Eye View (BEV) maps, which lack detailed texture representation of a scene. Consequently, while BEV maps are effective for semantic-level visual navigation, they are struggling for instance-level tasks. To this end, we propose a new framework for IIN, Gaussian Splatting for Visual Navigation (GaussNav), which constructs a novel map representation based on 3D Gaussian Splatting (3DGS). The GaussNav framework enables the agent to memorize both the geometry and semantic information of the scene, as well as retain the textural features of objects. By matching renderings of similar objects with the target, the agent can accurately identify, ground, and navigate to the specified object. Our GaussNav framework demonstrates a significant performance improvement, with Success weighted by Path Length (SPL) increasing from 0.347 to 0.578 on the challenging Habitat-Matterport 3D (HM3D) dataset. The source code is publicly available at the link: https://github.com/XiaohanLei/GaussNav.
Paper Structure (15 sections, 15 equations, 12 figures, 6 tables)

This paper contains 15 sections, 15 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Illustration of Instance ImageGoal Navigation (IIN), which requires agent to navigate to the object instance depicted in the goal image, while distinguishing it from other visually similar instances.
  • Figure 2: Framework Overview. In the first episode of a scene, the agent uses frontier exploration to gather observations of the unknown environment, constructing a Semantic Gaussian. In subsequent episodes, the pre-constructed Semantic Gaussian is utilized by Gaussian Navigation to ground the goal object and guide the agent towards it.
  • Figure 3: Exploration Map and Obstacle Map.
  • Figure 4: An illustration of Semantic Gaussian Construction. At timestep $t$, the pipeline updates the Gaussians from $t-1$ through densification and updating, which involves a comparison between the rendered RGB and depth images against the current input training views. Concurrently, semantic labels are assigned to the densified Gaussians using the segmented images. Finally, the Gaussians are refined through differentiable rendering.
  • Figure 5: An illustration of Gaussian Navigation. Our approach begins with the classification of the goal image using pre-constructed Semantic Gaussian. Upon determining the predicted class, we generate descriptive images around instances belonging to that class. These images are then matched with the target object to identify and ground the goal instance. Utilizing the map and the established goal, the agent employs path planning to compute the sequence of actions.
  • ...and 7 more figures