Table of Contents
Fetching ...

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

Esteban Padilla, Boyang Sun, Marc Pollefeys, Hermann Blum

TL;DR

This work forms navigation as a sparse subgoal identification and reaching problem and observes that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation, and proposes OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers

TL;DR

This work forms navigation as a sparse subgoal identification and reaching problem and observes that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation, and proposes OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.

Abstract

Open-world navigation requires robots to make decisions in complex everyday environments while adapting to flexible task requirements. Conventional navigation approaches often rely on dense 3D reconstruction and hand-crafted goal metrics, which limits their generalization across tasks and environments. Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language, but typically require interactive training, large-scale data collection, or task-specific fine-tuning with a mobile agent. We formulate navigation as a sparse subgoal identification and reaching problem and observe that providing visual anchoring targets for high-level semantic priors enables highly efficient goal-conditioned navigation. Based on this insight, we select navigation frontiers as semantic anchors and propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models. OpenFrontier enables efficient navigation with a lightweight system design, without dense 3D mapping, policy training, or model fine-tuning. We evaluate OpenFrontier across multiple navigation benchmarks and demonstrate strong zero-shot performance, as well as effective real-world deployment on a mobile robot.
Paper Structure (29 sections, 3 equations, 14 figures, 4 tables, 4 algorithms)

This paper contains 29 sections, 3 equations, 14 figures, 4 tables, 4 algorithms.

Figures (14)

  • Figure 1: System Overview. Given a posed RGB observation and a natural-language navigation goal, OpenFrontier detects visual frontiers in the image and directly queries a vision--language model to evaluate their relevance using in-image context. The resulting frontiers are then lifted into the 3D metric space with the updated information gain as goal-conditioned candidates and globally managed to update navigation targets, perform path planning, and determine termination.
  • Figure 2: The detected 2D frontier clusters are jointly queried with the corresponding RGB image using a set-of-marks prompting strategy. Each frontier is marked in the image, enabling the VLM to evaluate its relevance to the given navigation instruction within the local visual context. The resulting relevance probabilities are used to re-weight its exploration-driven information gain, effectively integrating task-specific semantic priors with exploration.
  • Figure 3: Navigation Results across two representative baseline methods and OpenFrontier on an HM3D validation scene with the goal of finding a bed. The red square and shaded region indicate the ground-truth target location and its success region. OpenFrontier makes more efficient decisions at multi-choice intersections, navigating directly toward the bedroom while avoiding redundant exploration of irrelevant areas.
  • Figure 4: Additional Navigation Examples. Top: OVON scenes with goals (left to right) refrigerator, picture, and dishwasher. Middle: MP3D scenes with goals stool, table, and cushion. Bottom: HM3D scenes with goals sofa, toilet, and bed. All experiments are conducted using the same system configuration and parameters across datasets.
  • Figure 5: OpenFrontier Navigation with Different Goal Contexts. Top: target is "plant in the bathroom." Bottom: target is "plant." The robot is initialized at the same starting location in both runs. From left to right, we show selected frames along the navigation trajectory together with the corresponding image observations overlaid with detected frontiers. The final image shows the final image observation, which terminates the robot once it has the target in the field of view. Despite observing similar frontier locations, OpenFrontier assigns different relevance probabilities depending on the goal context. As a result, the top trajectory prioritizes regions likely associated with the bathroom, while the bottom trajectory moves toward the living room, which also commonly contains plants.
  • ...and 9 more figures