Table of Contents
Fetching ...

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

Rwik Rana, Jesse Quattrociocchi, Dongmyeong Lee, Christian Ellis, Amanda Adkins, Adam Uccello, Garrett Warnell, Joydeep Biswas

TL;DR

OVerSeeC, a zero-shot modular framework that decomposes the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time, shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.

Abstract

Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into Interpret-Locate-Synthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses the user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.

OVerSeeC: Open-Vocabulary Costmap Generation from Satellite Images and Natural Language

TL;DR

OVerSeeC, a zero-shot modular framework that decomposes the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time, shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.

Abstract

Aerial imagery provides essential global context for autonomous navigation, enabling route planning at scales inaccessible to onboard sensing. We address the problem of generating global costmaps for long-range planning directly from satellite imagery when entities and mission-specific traversal rules are expressed in natural language at test time. This setting is challenging since mission requirements vary, terrain entities may be unknown at deployment, and user prompts often encode compositional traversal logic. Existing approaches relying on fixed ontologies and static cost mappings cannot accommodate such flexibility. While foundation models excel at language interpretation and open-vocabulary perception, no single model can simultaneously parse nuanced mission directives, locate arbitrary entities in large-scale imagery, and synthesize them into an executable cost function for planners. We therefore propose OVerSeeC, a zero-shot modular framework that decomposes the problem into Interpret-Locate-Synthesize: (i) an LLM extracts entities and ranked preferences, (ii) an open-vocabulary segmentation pipeline identifies these entities from high-resolution imagery, and (iii) the LLM uses the user's natural language preferences and masks to synthesize executable costmap code. Empirically, OVerSeeC handles novel entities, respects ranked and compositional preferences, and produces routes consistent with human-drawn trajectories across diverse regions, demonstrating robustness to distribution shifts. This shows that modular composition of foundation models enables open-vocabulary, preference-aligned costmap generation for scalable, mission-adaptive global planning.
Paper Structure (29 sections, 3 equations, 9 figures, 6 tables)

This paper contains 29 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview of OVerSeeC, which uses a satellite image $I$ and a natural language prompt $\mathcal{P}$ to generate a preference-aligned costmap $C$ for global planning. The Entity Identifier and Costmap Function Compositor (Sec. \ref{['sec:class_extractor']}, \ref{['sec:llm_code']}) use an LLM to extract relevant terrain classes $\mathcal{C}$ and synthesize a cost function $f_{\text{LLM}}(\cdot)$ respectively. The Open-Vocabulary Mask Generator (Sec. \ref{['sec:semseg_module']}, \ref{['sec:mask_refine']}) performs zero-shot semantic segmentation over $I$, yielding class masks $\{ \widehat{M}_c \}$ and thresholded probability maps $\{ \widehat{P}^{\tau}_c \}$, where $c \in \mathcal{C}$. Finally $f_{\text{LLM}}(\cdot)$ is executed to generate the final costmap $C$.
  • Figure 2: Open-Vocabulary Mask Generator (Sec. \ref{['sec:ovmg']}). Given a satellite image $I$ and extracted classes $\mathcal{C}$, the pipeline comprises two submodules: (i) Open-Vocabulary Semantic Segmentation (Sec. \ref{['sec:semseg_module']}), which produces per-class probability maps $P_c$ and coarse masks $M_c$ for open-ontology classes; and (ii) Mask Refinement (Sec. \ref{['sec:mask_refine']}), which refines them into fine probabilities $\widehat{P}_c$ and masks $\widehat{M}_c$.
  • Figure 3: Planning results for the $\mathcal{D}_2\texttt{-OOD-OV}$ scenario: Comparison of costmap alignment using RRPI (Sec. \ref{['sec:rrpi_metric_definition']}) metric under the user preference: “Prefer the roads and trails, grass should be fine, try to avoid the baseball field as much as possible.” The class ranking used are: road: 1, trail: 1, grass: 2, baseball field : 3, tree: 4, building: 5. The top row shows RRPI vs. path length scatter plots with KDE contours; the colored pointers in these plots indicates the COM of the KDE, and the solid line represents a linear regression fit. A lower slope for this line is preferable, as it indicates that the RRPI score remains low even as path length increases. The bottom row shows a subset of these trajectories generated from Dijkstra's algorithm overlaid on the map (start: arrow, goal: star).
  • Figure 5: RQ2 — Qualitative results from the human case study experiments, all in open-vocabulary settings (see Table \ref{['tab:rq2_rrpi_ov']}). Each example corresponds to the scenarios in Table \ref{['tab:rq1_human_align']}, showing that OVerSeeC adapts to novel categories and contextual prompt semantics. Across these OV examples, the trajectories produced by OVerSeeC are qualitatively closest to the human-drawn references, demonstrating strong alignment with operator intent.
  • Figure 6: RQ3 —OVerSeeC's segmentation output provides a reliable foundation for planning.
  • ...and 4 more figures