TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu; Bin Ren; Zhitong Xiong; Xiao Xiang Zhu; Begüm Demir; Nicu Sebe; Paolo Rota

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota

Abstract

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Abstract

Paper Structure (36 sections, 6 equations, 14 figures, 14 tables, 1 algorithm)

This paper contains 36 sections, 6 equations, 14 figures, 14 tables, 1 algorithm.

Introduction
Related Works
Method
Overview
TerraScope Framework
Terra-CoT Dataset
TerraScope-Bench
Experiments
Main Results
Ablation Studies
Qualitative Results
Conclusion
Limitations and Future Work
Comparison to Concurrent Works
Details of TerraScope
...and 21 more sections

Figures (14)

Figure 1: (a-1): The most common Vision Language Model (VLM) without reasoning directly outputs the wrong results. (a-2): Some solutions tried reasoning via textual Chain-of-Thought (CoT). (a-3): Our TerraScope, which takes the pixel-level grounding masks together with textual input, forming the interleaved CoT. (b): Our Terra-CoT 1M dataset. (c): Our TerraScope benchmark.
Figure 2: Overview of TerraScope. TerraScope generates textual reasoning tokens and segmentation masks in an interleaved manner, where masked visual features are injected at each reasoning step to ensure faithful pixel-grounded reasoning. TerraScope supports multi-modal and multi-temporal reasoning across EO data.
Figure 3: Terra-CoT curation pipeline. First, we generate Cap-CoT using ground truth masks and class labels to train an initial annotation model. Second, we use the trained model to annotate unlabeled data with pixel-accurate masks and captions. Third, based on the synthetic annotations, we apply hierarchical data synthesis to generate diverse reasoning questions with chain-of-thought traces at two levels: (L1) basic spatial grounding and (L2) complex multi-step reasoning including spatial and semantic tasks.
Figure 4: Examples of TerraScope-Bench.
Figure 5: Grounding IoU performance of different models.
...and 9 more figures

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Abstract

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Authors

Abstract

Table of Contents

Figures (14)