Table of Contents
Fetching ...

SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao

TL;DR

The paper defines geospatial pixel reasoning as a task where models infer segmentation masks for implicit queries in remote sensing. It introduces EarthReason, a large-scale dataset with 5,434 image-mask pairs across 28 categories and 30k+ implicit QA pairs, plus empty-target cases and multi-scale imagery to test generalization. It proposes SegEarth-R1, a language-guided segmentation model that combines a hierarchical visual encoder, an LLM for instruction parsing, and a description-embedding-based mask generator tailored for spatial correlation, with aggressive token compression and a description projector. Across extensive experiments, SegEarth-R1 achieves state-of-the-art results on geospatial pixel reasoning and referring segmentation, demonstrating strong generalization and efficiency, and the authors release data and code to foster further research.

Abstract

Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.

SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model

TL;DR

The paper defines geospatial pixel reasoning as a task where models infer segmentation masks for implicit queries in remote sensing. It introduces EarthReason, a large-scale dataset with 5,434 image-mask pairs across 28 categories and 30k+ implicit QA pairs, plus empty-target cases and multi-scale imagery to test generalization. It proposes SegEarth-R1, a language-guided segmentation model that combines a hierarchical visual encoder, an LLM for instruction parsing, and a description-embedding-based mask generator tailored for spatial correlation, with aggressive token compression and a description projector. Across extensive experiments, SegEarth-R1 achieves state-of-the-art results on geospatial pixel reasoning and referring segmentation, demonstrating strong generalization and efficiency, and the authors release data and code to foster further research.

Abstract

Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.

Paper Structure

This paper contains 27 sections, 1 equation, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Comparison of semantic segmentation, referring segmentation and geospatial pixel inference. (left) Samples from the LoveDA wang2021loveda and RRSIS-D liu2024rotated datasets. (right) Samples from the EarthReason dataset. Previous tasks are limited by fixed taxonomies and explicit instructions, while geospatial pixel reasoning supports complex implicit instructions and requires the reasoning capability of the model.
  • Figure 2: Overview of the proposed SegEarth-R1 architecture. Given an image $X_v$ and a text description $X_q$, a hierarchical visual encoder and a proposed connector are used to extract and compress visual tokens. Then, the visual tokens and description embeddings are fed into an LLM for instruction interpretation and semantic correlation. Finally, description embeddings are directly mapped to the query vector and used for spatial correlation and segmentation mask generation.
  • Figure 3: Redundancy analysis of remote sensing datasets and natural images, and the former exhibits higher redundancy.
  • Figure 4: $D$-Projector.
  • Figure 5: Qualitative Results of SegEarth-R1 on EarthReason. More results can be found in Appendix \ref{['sec:appendix_examples']}.
  • ...and 6 more figures