Table of Contents
Fetching ...

How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

Junting Chen, Guohao Li, Suryansh Kumar, Bernard Ghanem, Fisher Yu

TL;DR

This paper presents StructNav, a training-free object goal navigation framework that fuses classic semantic SLAM with frontier exploration and language-informed priors to guide search. By maintaining a structured scene representation (2D occupancy map, semantic point cloud, spatial scene graph) and a semantic frontier utility, StructNav achieves state-of-the-art results on Gibson without end-to-end training. Ablation studies identify semantic segmentation quality as a key bottleneck and show substantial gains from language-based priors over purely geometric exploration. The approach offers improved explainability and robustness in embodied navigation, with practical implications for ROS-based robotic deployment and sim-to-real analysis.

Abstract

Object goal navigation is an important problem in Embodied AI that involves guiding the agent to navigate to an instance of the object category in an unknown environment -- typically an indoor scene. Unfortunately, current state-of-the-art methods for this problem rely heavily on data-driven approaches, \eg, end-to-end reinforcement learning, imitation learning, and others. Moreover, such methods are typically costly to train and difficult to debug, leading to a lack of transferability and explainability. Inspired by recent successes in combining classical and learning methods, we present a modular and training-free solution, which embraces more classic approaches, to tackle the object goal navigation problem. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. We then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a goal object. Our structured scene representation comprises a 2D occupancy map, semantic point cloud, and spatial scene graph. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers. With injected semantic priors, the agent can reason about the most promising frontier to explore. The proposed pipeline shows strong experimental performance for object goal navigation on the Gibson benchmark dataset, outperforming the previous state-of-the-art. We also perform comprehensive ablation studies to identify the current bottleneck in the object navigation task.

How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers

TL;DR

This paper presents StructNav, a training-free object goal navigation framework that fuses classic semantic SLAM with frontier exploration and language-informed priors to guide search. By maintaining a structured scene representation (2D occupancy map, semantic point cloud, spatial scene graph) and a semantic frontier utility, StructNav achieves state-of-the-art results on Gibson without end-to-end training. Ablation studies identify semantic segmentation quality as a key bottleneck and show substantial gains from language-based priors over purely geometric exploration. The approach offers improved explainability and robustness in embodied navigation, with practical implications for ROS-based robotic deployment and sim-to-real analysis.

Abstract

Object goal navigation is an important problem in Embodied AI that involves guiding the agent to navigate to an instance of the object category in an unknown environment -- typically an indoor scene. Unfortunately, current state-of-the-art methods for this problem rely heavily on data-driven approaches, \eg, end-to-end reinforcement learning, imitation learning, and others. Moreover, such methods are typically costly to train and difficult to debug, leading to a lack of transferability and explainability. Inspired by recent successes in combining classical and learning methods, we present a modular and training-free solution, which embraces more classic approaches, to tackle the object goal navigation problem. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. We then inject semantics into geometric-based frontier exploration to reason about promising areas to search for a goal object. Our structured scene representation comprises a 2D occupancy map, semantic point cloud, and spatial scene graph. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers. With injected semantic priors, the agent can reason about the most promising frontier to explore. The proposed pipeline shows strong experimental performance for object goal navigation on the Gibson benchmark dataset, outperforming the previous state-of-the-art. We also perform comprehensive ablation studies to identify the current bottleneck in the object navigation task.
Paper Structure (17 sections, 3 equations, 6 figures, 3 tables)

This paper contains 17 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Object Navigation with Structured Scene Representation. ObjectNav can be decomposed into inferring the potential position of the target object in the scene and point-to-point planning. Provided a structured representation of the scene, which is composed of (i) a spatial scene graph, (ii) semantic point clouds, and (iii) a 2D occupancy map, an agent can handle the two sub-tasks by querying semantic and geometric information from the scene graph and occupancy map separately. For clarity, the scene graph and occupancy map are computed from the semantic points cloud, thus the semantic point cloud is also considered part of our structured scene representation.
  • Figure 2: Overview of the StructNav pipeline. Our pipeline runs in the loop of receiving observations and generating actions to navigate an agent to the goal object in an unknown scene. Colored boxes shows functional modules and arrows represent data flows: (1) RGBD observation ${\bm{I}}_t=({\bm{I}}_t^{\text{rgb}}$, ${\bm{I}}_t^{\text{depth}})$ (2) Semantic Image ${\bm{I}}_t^{sem}$ (3) Estimated pose $\hat{\bm{s}}_t^l$ (4) RGB point cloud ${\bm{P}}_t^{\text{RGB}}$ (5) Semantic point cloud ${\bm{P}}_t^{\text{sem}}$ (6) Spatial scene graph ${\mathcal{G}}_t$ (7) 2D occupancy map ${\bm{M}}_t$ (8) Frontiers ${\bm{F}}_t$ (9) Intermediate navigation goal ${\bm{x}}_t^{\text{goal}}.$
  • Figure 3: StructNav (Python script). After receiving the RGBD images from camera sensors, StructNav first updates the structured representation by processing the geometric and semantic information. Then, StructNav enters the navigation stage. The agent will move to the goal if the target is in this frame. Otherwise, the agent will navigate to the most promising frontier obtained from our structured representation.
  • Figure 4: Utility Module and Prior Matrices. a) Our utility module calculates the semantic utility from the spatial scene graph and the geometric utility from the frontiers, respectively. A policy module calculates the most promising frontier as the temporary navigation goal, based on the utilities and prior matrices. b) Prior matrices comprise a category-to-category prior distance matrix ${\bm{D}}_{prior}$ and a category-to-category prior distance variance matrix ${\bm{V}}_{prior}$. The green cross highlights the relationship between sink and toilet, that they have close proximity with high confidence. The purple cross highlights the relationship between bed and refridgerator, that they are always far from each other.
  • Figure 5: Visualization of navigation trajectories in Rviz. A bird-eye view of the maps and trajectories on a test scene recovered using our approach. For this example, the result is obtained by running GeoUtil and SemUtil on the same episode to navigate to couch on the test scene Wiconisco. The green sphere indicates the start of the episode. The red sphere suggests the end of this episode where the agent returns STOP action to the simulator. The map's blue dots indicate this episode's detected target object. The blue lines are the real trajectory recorded from the TF frame attached to the agent base, while the red lines are the planned path from the planner.
  • ...and 1 more figures