Table of Contents
Fetching ...

WildOS: Open-Vocabulary Object Search in the Wild

Hardik Shah, Erica Tevere, Deegan Atha, Marcel Kaufmann, Shehryar Khattak, Manthan Patel, Marco Hutter, Jonas Frey, Patrick Spieler

TL;DR

This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning and introduces a particle-filter-based method for coarse localization of the open-vocabulary target query, enabling effective planning toward distant goals.

Abstract

Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot's immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded. Project Page: https://leggedrobotics.github.io/wildos/

WildOS: Open-Vocabulary Object Search in the Wild

TL;DR

This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning and introduces a particle-filter-based method for coarse localization of the open-vocabulary target query, enabling effective planning toward distant goals.

Abstract

Autonomous navigation in complex, unstructured outdoor environments requires robots to operate over long ranges without prior maps and limited depth sensing. In such settings, relying solely on geometric frontiers for exploration is often insufficient. In such settings, the ability to reason semantically about where to go and what is safe to traverse is crucial for robust, efficient exploration. This work presents WildOS, a unified system for long-range, open-vocabulary object search that combines safe geometric exploration with semantic visual reasoning. WildOS builds a sparse navigation graph to maintain spatial memory, while utilizing a foundation-model-based vision module, ExploRFM, to score frontier nodes of the graph. ExploRFM simultaneously predicts traversability, visual frontiers, and object similarity in image space, enabling real-time, onboard semantic navigation tasks. The resulting vision-scored graph enables the robot to explore semantically meaningful directions while ensuring geometric safety. Furthermore, we introduce a particle-filter-based method for coarse localization of the open-vocabulary target query, that estimates candidate goal positions beyond the robot's immediate depth horizon, enabling effective planning toward distant goals. Extensive closed-loop field experiments across diverse off-road and urban terrains demonstrate that WildOS enables robust navigation, significantly outperforming purely geometric and purely vision-based baselines in both efficiency and autonomy. Our results highlight the potential of vision foundation models to drive open-world robotic behaviors that are both semantically informed and geometrically grounded. Project Page: https://leggedrobotics.github.io/wildos/
Paper Structure (75 sections, 32 equations, 16 figures, 2 tables, 9 algorithms)

This paper contains 75 sections, 32 equations, 16 figures, 2 tables, 9 algorithms.

Figures (16)

  • Figure 1: (a) WildOS enables autonomous semantic navigation in diverse unstructured outdoor environments. (b, c) Due to the limited range of geometric sensing, robots can only reliably perceive nearby regions within a depth horizon (blue), leading to myopic exploration and (e) difficulty localizing distant targets (e.g., a “house”) beyond sensing range. (b, c) Conventional exploration (dashed path) relies on geometric frontiers (blue dots) at the boundary between known and unknown space, which ignores long-range semantic and traversability cues. WildOS (green path) augments geometric exploration with long-range visual reasoning using a vision foundation model, defining a visual horizon (red) that extends beyond depth sensing and predicts visual traversability, visual frontiers (red dots), and open-vocabulary object similarity in image space. (d) During deployment, a sparse navigation graph is built from geometry and frontier nodes are scored using vision, while a particle-filter-based goal localization module (yellow particles) estimates candidate goal locations beyond the depth horizon, enabling safe, efficient planning toward distant semantic goals.
  • Figure 2: Method Overview consisting of five main components: 1) WildOS incrementally builds a sparse navigation graph from geometric sensing to maintain persistent spatial memory and identify geometric frontier nodes for safe exploration (Sec. \ref{['sec:navigation_graph']}). 2) To reason beyond the limited depth horizon, a learned vision-language module, ExploRFM, processes the current image and text query to predict visual traversability, visual frontiers, and open-vocabulary object similarity over a long-range visual horizon (Sec. \ref{['sec:explorfm']}). 3) Object detections from multiple viewpoints are fused by a probabilistic goal triangulation module to estimate a coarse 3D target location beyond direct sensor range (Sec. \ref{['sec:triangulation']}). 4) Geometric frontier nodes are then projected into the image and scored using the visual-semantic cues and the current goal estimate, producing a semantically scored navigation graph (Sec. \ref{['sec:scored_graph']}). 5) Finally, a hierarchical planner selects and executes actions by planning over the scored graph and generating locally safe motions toward intermediate goals (Sec. \ref{['sec:planner']}).
  • Figure 3: Navigation Graph Construction (a) Node Sampling in free cells of $\mathcal{T}^{\text{geo}}_t$, and assigning $r^f_i \text{ (free radius)}$. Invalid samples are shown in red. (b) $r^e_i \text{ (explored radius)}$ for each node. (c) Identifying frontier cells and assigning them to nodes to get $\mathcal{F}^{\text{geo}}_t$. (d) $r_{\text{edge}}$ for the current node. (e) Dense graph obtained after first iteration of graph construction. (f) Robot pose $\mathbf{x}_t$ update and shift of $\mathcal{T}^{\text{geo}}_t$. (g) Node Sampling and update $r^f_i, r^e_i$ (h) Detect new frontier cells and remove invalid ones. Invalid ones shown in red - present in known regions, present within explored radius of another node. (i) Assign frontier cells to nodes, and update frontier nodes.
  • Figure 4: ExploRFM Architecture. ExploRFM builds upon the RADIO vision foundation model to jointly reason about traversability, semantic frontiers, and goal-object localization from a single RGB frame and language query.
  • Figure 5: Coarse goal query localization. Ray-distance weighting based triangulation. Particles with darker shades get a higher weight.
  • ...and 11 more figures