Table of Contents
Fetching ...

DescribeEarth: Describe Anything for Remote Sensing Images

Kaiyu Li, Zixuan Jiang, Xiangyong Cao, Jiayu Wang, Yuchen Xiao, Deyu Meng, Zhi Wang

TL;DR

Geo-DLC targets object-level, fine-grained captioning for remote sensing imagery. The authors introduce DE-Dataset and DE-Benchmark to enable scalable, domain-aware training and evaluation, and propose DescribeEarth, a scale-aware, domain-guided MLLM that fuses RS priors with high-resolution local features while keeping the vision backbone frozen. The approach shows consistent superiority over state-of-the-art general MLLMs on the DE-Benchmark, with notable gains in contextual and appearance-based descriptions, and robust performance in complex and out-of-distribution scenarios. This work provides a practical framework and resources for dense, geospatially grounded descriptions, enabling automated, interpretable RS analytics and advancing vision-language modeling in the Earth observation domain.

Abstract

Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.

DescribeEarth: Describe Anything for Remote Sensing Images

TL;DR

Geo-DLC targets object-level, fine-grained captioning for remote sensing imagery. The authors introduce DE-Dataset and DE-Benchmark to enable scalable, domain-aware training and evaluation, and propose DescribeEarth, a scale-aware, domain-guided MLLM that fuses RS priors with high-resolution local features while keeping the vision backbone frozen. The approach shows consistent superiority over state-of-the-art general MLLMs on the DE-Benchmark, with notable gains in contextual and appearance-based descriptions, and robust performance in complex and out-of-distribution scenarios. This work provides a practical framework and resources for dense, geospatially grounded descriptions, enabling automated, interpretable RS analytics and advancing vision-language modeling in the Earth observation domain.

Abstract

Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.

Paper Structure

This paper contains 28 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Some methods proposed to resolve remote sensing image description related tasks. (a) and (b) are typical image captioning frameworks lu2017exploringzhang2021shiprsimagenet. (c) is a recent object description model jiang2025eaglevision, but it can only generate template-based descriptions across a limited number of categories and does not support interactive descriptions. (d) is our DescribeEarth model, which can generate detailed, open-ended and localized descriptions based on off-the-shelf detectors or user interactions.
  • Figure 2: Comparison of DescribeEarth, DAM xiao2025describe, and other general MLLMs hurst2024gptcomanici2025gemini in describing remote sensing images. Green rendering represents intrinsic features of the target, yellow denotes spatial features, blue indicates contextual features, and red signifies incorrect descriptions. DescribeEarth delivers the most detailed and accurate object descriptions.
  • Figure 3: The production pipeline of our DE-Dataset. (a) is the tiling strategy used in dataset preprocessing. (b) is the generation process of localized description.
  • Figure 4: The distribution of instance categories and their corresponding quantities ($log_{10}$) in DE-Dataset dataset.
  • Figure 5: Examples of different instance types in DE-Benchmark.
  • ...and 3 more figures