ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Ruixun Liu; Bowen Fu; Jiayi Song; Kaiyu Li; Wanchen Li; Lanxuan Xue; Hui Qiao; Weizhan Zhang; Deyu Meng; Xiangyong Cao

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

Ruixun Liu, Bowen Fu, Jiayi Song, Kaiyu Li, Wanchen Li, Lanxuan Xue, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao

TL;DR

This work tackles the inefficiency of passive perception in ultra-high-resolution remote sensing by introducing an active perception paradigm. It introduces LRS-GRO, a large-scale, multi-level VQA benchmark with ROI annotations, and ZoomEarth, a cropping–zooming RS foundation model trained with supervised fine-tuning and reinforcement learning using a Region-Guided reward. The approach achieves state-of-the-art results on LRS-GRO and strong zero-shot performance on public UHR RS benchmarks, while offering a training-free toolkit for downstream tasks such as cloud removal, denoising, segmentation, and image editing. The combination of active ROI localization, ROI-focused reasoning, and a scalable toolkit makes ZoomEarth a versatile platform for building autonomous RS agents capable of efficient, fine-grained geospatial understanding. This work thus advances both datasets and methods for active perception in RS and enables practical, extensible applications in real-world geospatial analysis.

Abstract

Ultra-high-resolution (UHR) remote sensing (RS) images offer rich fine-grained information but also present challenges in effective processing. Existing dynamic resolution and token pruning methods are constrained by a passive perception paradigm, suffering from increased redundancy when obtaining finer visual inputs. In this work, we explore a new active perception paradigm that enables models to revisit information-rich regions. First, we present LRS-GRO, a large-scale benchmark dataset tailored for active perception in UHR RS processing, encompassing 17 question types across global, region, and object levels, annotated via a semi-automatic pipeline. Building on LRS-GRO, we propose ZoomEarth, an adaptive cropping-zooming framework with a novel Region-Guided reward that provides fine-grained guidance. Trained via supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), ZoomEarth achieves state-of-the-art performance on LRS-GRO and, in the zero-shot setting, on three public UHR remote sensing benchmarks. Furthermore, ZoomEarth can be seamlessly integrated with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong versatility and extensibility.

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

TL;DR

Abstract

ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)

Theorems & Definitions (1)