Table of Contents
Fetching ...

R2F: Repurposing Ray Frontiers for LLM-free Object Navigation

Francesco Argenziano, John Mark Alexis Marcelo, Michele Brienza, Abdel Hakim Drid, Emanuele Musumeci, Daniele Nardi, Domenico D. Bloisi, Vincenzo Suriani

TL;DR

This work repurposes ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation, and introduces R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components.

Abstract

Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.

R2F: Repurposing Ray Frontiers for LLM-free Object Navigation

TL;DR

This work repurposes ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation, and introduces R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components.

Abstract

Zero-shot open-vocabulary object navigation has progressed rapidly with the emergence of large Vision-Language Models (VLMs) and Large Language Models (LLMs), now widely used as high-level decision-makers instead of end-to-end policies. Although effective, such systems often rely on iterative large-model queries at inference time, introducing latency and computational overhead that limit real-time deployment. To address this problem, we repurpose ray frontiers (R2F), a recently proposed frontier-based exploration paradigm, to develop an LLM-free framework for indoor open-vocabulary object navigation. While ray frontiers were originally used to bias exploration using semantic cues carried along rays, we reinterpret frontier regions as explicit, direction-conditioned semantic hypotheses that serve as navigation goals. Language-aligned features accumulated along out-of-range rays are stored sparsely at frontiers, where each region maintains multiple directional embeddings encoding plausible unseen content. In this way, navigation then reduces to embedding-based frontier scoring and goal tracking within a classical mapping and planning pipeline, eliminating iterative large-model reasoning. We further introduce R2F-VLN, a lightweight extension for free-form language instructions using syntactic parsing and relational verification without additional VLM or LLM components. Experiments in Habitat-sim and on a real robotic platform demonstrate competitive state-of-the-art zero-shot performance with real-time execution, achieving up to 6 times faster runtime than VLM-based alternatives.
Paper Structure (16 sections, 6 equations, 4 figures, 1 table)

This paper contains 16 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: R2F at a glance. A representative zero-shot open-vocabulary object navigation episode in a photorealistic indoor scene. As the agent explores, open-vocabulary semantic evidence is accumulated along out-of-range rays and attached to frontier regions (visualized as cosine-similarity heatmaps). Frontier scores increase in directions consistent with the target query, enabling embedding-based subgoal selection without iterative LLM/VLM deliberation. The episode terminates when the target is confidently detected (Goal found!). Code and supplementary material are available at https://lab-rococo-sapienza.github.io/r2f/.
  • Figure 2: R2F system overview and execution schedule. The agent receives RGB-D observations and a text query. RGB images are processed by RADIO with a Neighborhood-Aware attention modification (NA) to produce dense open-vocabulary features (NA-RADIO), while the query is encoded with SigLIP. From out-of-range depth pixels, semantic rays are sampled, binned by direction, and associated with frontier regions, forming Semantic Ray Frontiers that accumulate directional semantic evidence. In parallel, depth updates a volumetric occupancy map, from which frontier regions are periodically recomputed (low frequency), while semantic ray accumulation runs continuously (high frequency). Frontier regions are scored via cosine similarity with the query embedding, and the navigation policy (R2F or R2F-VLN) selects and tracks the highest-scoring frontier until the goal is detected or exploration continues.
  • Figure 3: NA-RADIO's feature maps in comparison with different text query. Both the text query embedding and the visual features lie in the SigLIP zhai2023sigmoid embedding space thanks to RADIO's adapter. In practice, we observe that informative similarity values typically lie in the range $[-0.10, 0.15]$.
  • Figure 4: Pipeline execution on a real robot for the "Find a sink" goal. The robot navigates between the boundaries based on their semantic value until it reaches the goal.