TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

Nazar Puriy; Johannes Jakubik; Benedikt Blumenstiel; Konrad Schindler

TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

Nazar Puriy, Johannes Jakubik, Benedikt Blumenstiel, Konrad Schindler

Abstract

We propose TerraFlow, a novel approach to multimodal, multitemporal learning for Earth observation. TerraFlow builds on temporal training objectives that enable sequence-aware learning across space, time, and modality, while remaining robust to the variable-length inputs commonly encountered in real-world Earth observation data. Our experiments demonstrate superiority of TerraFlow over state-of-the-art foundation models for Earth observation across all temporal tasks of the GEO-Bench-2 benchmark. We additionally demonstrate that TerraFlow is able to make initial steps towards deep-learning based risk map prediction for natural disasters -- a task on which other state-of-the-art foundation models frequently collapse. TerraFlow outperforms state-of-the-art foundation models by up to 50% in F1 score and 24% in Brier score.

TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

Abstract

Paper Structure (20 sections, 3 equations, 10 figures, 10 tables)

This paper contains 20 sections, 3 equations, 10 figures, 10 tables.

Introduction
Background
Methodology
Experimental Setting
Experiments
Spatial Risk Map Prediction
Kuro Siwo.
ImpactMesh.
Temporal Analysis.
Discussion and Concluding Remarks
Masking Strategies
Experimental Setting: Downstream Applications
Additional Results
Ablation Studies
Preliminary experiments
...and 5 more sections

Figures (10)

Figure 1: Image encoders can be applied to temporal tasks using a late fusion approach (left) while TerraFlow uses temporal attention for early fusion (middle). We benchmark TerraFlow on four temporal GEO-Bench-2 datasets and challenging disaster risk maps from ImpactMesh and Kuro Siwo (right).
Figure 2: TerraFlow pretraining. Input and target patches are sampled from multiple modalities and timestamps, represented by either raw pixel values or tokens. TerraFlow uses a standard transformer encoder-decoder architecture, trained with cross entropy loss. The attention blocks apply RoPE to encode the relative temporal offset between timestamps in the queries and keys.
Figure 3: Qualitative comparison on the Kuro Siwo flood risk prediction task. The first columns display two pre-event Sentinel-1 images with VV-VH-VV/VH pseudo coloring, along with the DEM. TerraFlow predicts logical risk maps and understands permanent water bodies while image-level models like TerraMind completely fail to accurately predict risk.
Figure 4: Qualitative comparison on the ImpactMesh-Fire risk prediction task. The inputs include two pre-event Sentinel-2 images, two Sentinel-1 images with VV-VH-VV/VH pseudo coloring, and DEM. No model, including TerraFlow, is able to learn meaningful patterns for fire prediction which may stem from the unpredictable human influence on many fire events.
Figure 5: Downstream task performance on test sets as a function of the number of timesteps per sample, uniformly sampled from the available set.
...and 5 more figures

TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

Abstract

TerraFlow: Multimodal, Multitemporal Representation Learning for Earth Observation

Authors

Abstract

Table of Contents

Figures (10)