Table of Contents
Fetching ...

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

TL;DR

This work tackles the inefficiency of passive perception in ultra-high-resolution remote sensing by introducing an active perception paradigm. It introduces LRS-GRO, a large-scale, multi-level VQA benchmark with ROI annotations, and ZoomEarth, a cropping–zooming RS foundation model trained with supervised fine-tuning and reinforcement learning using a Region-Guided reward. The approach achieves state-of-the-art results on LRS-GRO and strong zero-shot performance on public UHR RS benchmarks, while offering a training-free toolkit for downstream tasks such as cloud removal, denoising, segmentation, and image editing. The combination of active ROI localization, ROI-focused reasoning, and a scalable toolkit makes ZoomEarth a versatile platform for building autonomous RS agents capable of efficient, fine-grained geospatial understanding. This work thus advances both datasets and methods for active perception in RS and enables practical, extensible applications in real-world geospatial analysis.

Abstract

Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

TL;DR

This work tackles the inefficiency of passive perception in ultra-high-resolution remote sensing by introducing an active perception paradigm. It introduces LRS-GRO, a large-scale, multi-level VQA benchmark with ROI annotations, and ZoomEarth, a cropping–zooming RS foundation model trained with supervised fine-tuning and reinforcement learning using a Region-Guided reward. The approach achieves state-of-the-art results on LRS-GRO and strong zero-shot performance on public UHR RS benchmarks, while offering a training-free toolkit for downstream tasks such as cloud removal, denoising, segmentation, and image editing. The combination of active ROI localization, ROI-focused reasoning, and a scalable toolkit makes ZoomEarth a versatile platform for building autonomous RS agents capable of efficient, fine-grained geospatial understanding. This work thus advances both datasets and methods for active perception in RS and enables practical, extensible applications in real-world geospatial analysis.

Abstract

Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

Paper Structure

This paper contains 45 sections, 9 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Typical examples from our proposed benchmark LRS-GRO and results obtained by our ZoomEarth framework. LRS-GRO focuses on UHR RS imagery, including 17 multimodal vision-language understanding categories and emphasizing active perception process.
  • Figure 2: Comparison between passive perception and active perception. (a) Dynamic Resolution dong2024internlmxcomposer24khdpioneeringlargevisionlanguagezhao2024dynrefer and (b) Token Pruning guo2025cropcontextualregionorientedvisualAlvar_2025_CVPR represent passive perception approaches with only a single image input. (c) We introduce a cropping–zooming based active perception method. "Zooming" refers to restoring the cropped image at its original high-resolution image.
  • Figure 3: The visualization of the training and evaluation pipeline of our proposed methods. The model architecture diagram in the upper-right corner demonstrates the model's ability to adaptively crop the ROI by generating the BBox, and subsequently perform advanced reasoning. For clarity in the illustration, we omit the input of query tokens.
  • Figure 4: (a) Construction pipeline of our proposed LRS-GRO dataset, in which manual filtering and refinement are performed after Step 3. (b) The upper chart shows the 17 question types in the LRS-GRO dataset, whereas the lower chart shows the distribution of typical answer categories, demonstrating the dataset's balance.
  • Figure 5: Comparison between $r_{IoU}$ and $r_{R-G}$.
  • ...and 17 more figures

Theorems & Definitions (1)

  • Definition 1: Active Perception Oriented RS VQA