FrontierNet: Learning Visual Cues to Explore

Boyang Sun; Hanzhi Chen; Stefan Leutenegger; Cesar Cadena; Marc Pollefeys; Hermann Blum

FrontierNet: Learning Visual Cues to Explore

Boyang Sun, Hanzhi Chen, Stefan Leutenegger, Cesar Cadena, Marc Pollefeys, Hermann Blum

TL;DR

FrontierNet introduces a visual-only frontier-based exploration framework that learns to propose frontier regions and predict their information gain directly from posed RGB images augmented with monocular depth priors. By grounding 2D frontier cues in 3D through a lightweight anchoring process ( viewpoint generation, clustering, and lifting ), the method produces sparse, query-efficient 3D frontiers that drive an occupancy-map–guided planner. The approach achieves notable gains in early exploration efficiency (approximately 15% as reported) and remains robust under monocular depth predictions, with strong sim-to-real transfer demonstrated on a Spot robot. The combination of appearance-based frontier detection, learned information gain, and a map-free planning workflow offers a practical, scalable alternative to dense 3D-map–dependent exploration. FrontierNet is validated through extensive HM3D simulations and real-world experiments, including multi-floor and cluttered environments.

Abstract

Exploration of unknown environments is crucial for autonomous robots; it allows them to actively reason and decide on what new data to acquire for different tasks, such as mapping, object discovery, and environmental assessment. Existing solutions, such as frontier-based exploration approaches, rely heavily on 3D map operations, which are limited by map quality and, more critically, often overlook valuable context from visual cues. This work aims at leveraging 2D visual cues for efficient autonomous exploration, addressing the limitations of extracting goal poses from a 3D map. We propose a visual-only frontier-based exploration system, with FrontierNet as its core component. FrontierNet is a learning-based model that (i) proposes frontiers, and (ii) predicts their information gain, from posed RGB images enhanced by monocular depth priors. Our approach provides an alternative to existing 3D-dependent goal-extraction approaches, achieving a 15\% improvement in early-stage exploration efficiency, as validated through extensive simulations and real-world experiments. The project is available at https://github.com/cvg/FrontierNet.

FrontierNet: Learning Visual Cues to Explore

TL;DR

Abstract

Paper Structure (19 sections, 5 equations, 11 figures, 3 tables)

This paper contains 19 sections, 5 equations, 11 figures, 3 tables.

Introduction
Related Work
Method
Problem Statement
System Overview
Learning to Propose Frontiers from Visual Appearance
Data Generation and Model Training
Anchoring Frontier in 3D
Viewpoint Generation
Clustering
3D Lifting
Exploration Planning
Frontier Update
Path Planning
Experiment And Result
...and 4 more sections

Figures (11)

Figure 1: Top: FrontierNet processes a RGB image (left) to propose frontier pixels and their information gain (middle), registering candidate goal viewpoints with varying priorities in 3D (right). Bottom: Using FrontierNet, our exploration system prioritizes visiting unknown regions with greater potential of unmapped volume, achieving higher efficiency.
Figure 2: FrontierNet learns to propose regions for exploration from visual cues in RGB images. Unlike existing methods, it avoids operations on dense 3D maps at the proposal stage, which are sensitive to map quality, and often discard rich appearance information.
Figure 3: System Overview. Our system processes posed RGB images with a depth prediction model hu2024metric3d to generate estimated depth. FrontierNet uses visual input to predict 2D frontier regions and their info gain, which are transformed into sparse 3D frontiers with different gains (colored frustums). These frontiers are tracked, and the planning module selects the next best goal and plans a path using the occupancy map.
Figure 4: Ground Truth Generation. For a sampled camera pose in the voxelized scene, 3D frontier voxels are calculated and projected onto the camera frame using ground truth 3D occupancy grid. Merging the projection with the depth discontinuity mask produces a refined and less noisy frontier pixels mask $\mathbf{F}$, which is used to calculate the distance field map $\mathbf{D}$. Additionally, projecting the info gain of each frontier voxel onto the camera frame generates the info gain map $\mathbf{G}$.
Figure 5: 3D Frontier Generation. Each frontier pixel is assigned a 2D viewing angle derived from the depth gradient. Combined with the info gain, 2D clustering is applied to obtain sparse 2D frontier clusters with associated viewing directions (middle). The foreground and background depths near the frontier pixels are then utilized to lift each clustered 2D frontier into 3D space (right).
...and 6 more figures

FrontierNet: Learning Visual Cues to Explore

TL;DR

Abstract

FrontierNet: Learning Visual Cues to Explore

Authors

TL;DR

Abstract

Table of Contents

Figures (11)