GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Fengxiang Wang; Mingshuo Chen; Yueying Li; Yajie Yang; Yifan Zhang; Long Lan; Xue Yang; Hongda Sun; Yulin Wang; Di Wang; Jun Song; Jing Zhang; Bo Du

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du

TL;DR

GeoEyes is proposed, a staged training framework consisting of a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), and an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions, which achieves substantial improvements on UHR remote sensing benchmarks.

Abstract

The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 5 figures, 11 tables)

This paper contains 15 sections, 3 equations, 5 figures, 11 tables.

Introduction
Related Work
Dataset
Method
Preliminaries: Zoom-in Agentic RL Paradigm
Our Approach
Cold-start SFT
AdaZoom-GRPO
Experiments
Experimental Setup
Main Results
Ablation Study
Conclusion
Data Construction Pipeline
Details analysis of Ablation Study

Figures (5)

Figure 1: Illustration of the "Tool Usage Homogenization" phenomenon in UHR RS benchamrk.Domain Data means SuperRS-VQA geollava. Tool Usage Homogenization: a collapse to a near-constant one-call tool pattern across samples. Avg. Tool Usage means tool-call depth over samples that invoke the tool. Our GeoEyes triggers the tool on 68.44% of evaluation samples, compared with 100% for Deepeyes and its domain-augmented variant.
Figure 2: GeoEyes enables task-adaptive tool use for UHR remote-sensing reasoning, from tool-free inference to multi-round progressive zooming. Specially, we introduce UHR-CoZ, an interleaved image–text CoT dataset, and AdaZoom-GRPO, a tailored RL method that trains GeoEyes to use tools for evidence gain. GeoEyes significantly outperforms state-of-the-art closed- and open-source baselines.
Figure 3: Automated data construction pipeline. Stage 1 performs UHR-CoF annotation generation with multi-round zoom_in. Stage 2 applies quality control, including answer cleaning and trajectory cleaning. An example two-step zoom_in trajectory is shown to illustrate the multi-round generation process.
Figure 4: Overview of our method. We first perform cold-start SFT on UHR-CoZ, then apply AdaZoom-GRPO for RL. AdaZoom-GRPO mainly includes an Adaptive Efficiency reward for task heterogeneity, a Chain-of-Focus reward for low evidence density, and a Process Verification reward to enforce logical rigor.
Figure 5: UHR-CoZ example with three zoom-in calls. Left: the interleaved dialogue, where the model alternates between stepwise reasoning and tool_call requests that specify normalized bounding boxes. Right: the corresponding multi-scale views returned by the agent, from the global image (image_0) to three progressively localized crops (image_1--image_3). Red dashed boxes and arrows indicate the selected regions across rounds, illustrating iterative evidence acquisition for answering the question.

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

TL;DR

Abstract

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (5)