GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

Anindya Sarkar; Srikumar Sastry; Aleksis Pirinen; Chongjie Zhang; Nathan Jacobs; Yevgeniy Vorobeychik

GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

Anindya Sarkar, Srikumar Sastry, Aleksis Pirinen, Chongjie Zhang, Nathan Jacobs, Yevgeniy Vorobeychik

TL;DR

This work tackles active geo-localization when target goals are described in multiple modalities (text, ground-level, or aerial imagery) by introducing GOMAA-Geo, a goal modality agnostic agent that uses a GC-POMDP formulation with a discretized grid. The method fuses cross-modal alignment via CLIP-based embeddings (CLIP-MMFE), history-aware planning through a goal-aware LLM pretraining scheme (GASP), and reinforcement learning planning (PPO), enabling zero-shot generalization across unseen modalities and disaster scenarios. The authors validate on Masa and a new MM-GAG dataset, showing substantial SR gains over strong baselines and robust zero-shot transfers to xBD disaster data, thus demonstrating practical potential for SAR and environmental monitoring. Limitations include a grid-based, planar navigation and an explained need for STOP-action studies; future work aims to extend to continuous spaces and real-world UAV systems while further refining planning under uncertainty.

Abstract

We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. This could emulate a UAV involved in a search-and-rescue operation navigating through an area, observing a stream of aerial images as it goes. The AGL task is associated with two important challenges. Firstly, an agent must deal with a goal specification in one of multiple modalities (e.g., through a natural language description) while the search cues are provided in other modalities (aerial imagery). The second challenge is limited localization time (e.g., limited battery life, urgency) so that the goal must be localized as efficiently as possible, i.e. the agent must effectively leverage its sequentially observed aerial views when searching for the goal. To address these challenges, we propose GOMAA-Geo - a goal modality agnostic active geo-localization agent - for zero-shot generalization between different goal modalities. Our approach combines cross-modality contrastive learning to align representations across modalities with supervised foundation model pretraining and reinforcement learning to obtain highly effective navigation and localization policies. Through extensive evaluations, we show that GOMAA-Geo outperforms alternative learnable approaches and that it generalizes across datasets - e.g., to disaster-hit areas without seeing a single disaster scenario during training - and goal modalities - e.g., to ground-level imagery or textual descriptions, despite only being trained with goals specified as aerial views. Code and models are publicly available at https://github.com/mvrl/GOMAA-Geo/tree/main.

GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

TL;DR

Abstract

Paper Structure (27 sections, 6 equations, 17 figures, 15 tables)

This paper contains 27 sections, 6 equations, 17 figures, 15 tables.

Introduction
Related Work
Active Geo-localization Setup
Proposed Framework for Goal Modality Agnostic Active Geo-localization
Experiments and Results
Further Analyses and Ablation Studies
Conclusions
Effectiveness of the Proposed Dense Reward Function
More Visualizations of Exploration Behavior of GOMAA-Geo across different Goal Modalities
Evaluation of GOMAA-Geo across Different Grid Sizes
Trade-off: Modality-specific vs. Modality-invariant Goal Representation in Active Geolocalization
More Visualizations of Exploration Behavior of GOMAA-Geo
More Qualitative Evaluation and Zero-Shot generalizability of GASP
Details of the RPG-GOMAA Framework
Performance Comparison of GOMAA-Geo with Varying Search Budget $\mathcal{B}$
...and 12 more sections

Figures (17)

Figure 1: Active geo-localization across different goal modalities. The agent must navigate to the goal (yellow dot) based on partial aerial glimpses, i.e. the full area is never observed in its entirety.
Figure 2: GASP strategy for pretraining LLMs for AGL.
Figure 3: Our proposed GOMAA-Geo framework for active geo-localization.
Figure 4: Example exploration behavior of GOMAA-Geo across different goal modalities. The stochastic policy selects actions probabilistically, whereas the argmax policy selects the action with the highest probability.
Figure 5: Examples of successful exploration behaviors of GOMAA-Geo.
...and 12 more figures

GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

TL;DR

Abstract

GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

Authors

TL;DR

Abstract

Table of Contents

Figures (17)