How To Embed Matters: Evaluation of EO Embedding Design Choices

Luis Gilch; Isabelle Wittmann; Maximilian Nitsche; Johannes Jakubik; Arne Ewald; Thomas Brunschwiler

How To Embed Matters: Evaluation of EO Embedding Design Choices

Luis Gilch, Isabelle Wittmann, Maximilian Nitsche, Johannes Jakubik, Arne Ewald, Thomas Brunschwiler

TL;DR

This work presents a systematic analysis of embedding design in GeoFM-based EO workflows, and demonstrates the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data.

Abstract

Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.

How To Embed Matters: Evaluation of EO Embedding Design Choices

TL;DR

Abstract

Paper Structure (31 sections, 4 equations, 12 figures, 5 tables)

This paper contains 31 sections, 4 equations, 12 figures, 5 tables.

Introduction
Related Work
Geospatial Foundation Models.
Embedding-Centric Workflows in EO.
Benchmarks for GeoFMs and Embeddings.
Methodology
Evaluation Protocol
NeuCo-Bench Setup.
Cross-Validation and Metrics.
Benchmark Dataset.
GeoFM Backbones and Pretraining Strategies
Temporal and Spatial Aggregation
Concatenation Experiments
Intermediate-Layer Analysis
Experiments and Results
...and 16 more sections

Figures (12)

Figure 1: Per-task embedding performance across design choices. Distribution of regression performance across GeoFM backbones, self-supervised pretraining strategies, spatial aggregation methods, intermediate layers, and representation combinations. Performance is measured using mean $R^2$ (left), reflecting predictive accuracy, and the NeuCo Quality Score (right), which accounts for variability to reflect robustness. While most methods achieve similar peak accuracy for more saturated tasks, robustness varies, leading to clearer differentiation in the Quality Score; concatenated representations often rank among the most robust configurations in these cases. Boxplots summarize the distribution over all evaluated embedding variants; whiskers denote $1.5\times$ IQR and outliers are omitted. Markers indicate the single best-performing embedding configuration among all evaluated variants for the respective task.
Figure 2: Per-task Q-score comparison of ResNet-50 (left) and ViT-Small (right) FMs. We use final-layer embeddings with mean pooling; negative scores are clipped to zero. ResNet models score high on semantic/land-cover tasks but show little performance elsewhere. ViT models are more consistent across tasks and achieve meaningful performance beyond land cover: TerraMind is the most consistent overall, DINO is strong on land cover but weaker on other tasks, and FGMAE excels on the cloud-cover and biomass tasks. Radial axis is centered at $0$, and the maximal radius is set globally with a fixed buffer. A corresponding task $R^2$ plot is provided in the supplementary.
Figure 3: Per-task Q-Score comparison of spatial aggregation method for ResNet-50 (left) and ViT-Small (right). We use final-layer embeddings with mean, min, or max pooling (or the CLS token for ViT) and average scores across models; negative scores are clipped to zero. For ResNet, mean pooling performs best across tasks, with max pooling outperforming min pooling. For ViT, mean pooling again performs best, with CLS comparable on most tasks, while min and max pooling are similar but weaker—especially on continuous biophysical targets. A corresponding task $R^2$ plot is provided in the supplementary material.
Figure 4: Per-task and overall $\Delta R^2$ (left) and $\Delta$ Q-score (right) for embedding concatenation. Top: Intra-method concatenation (Mean + CLS within the same ViT-Small SSL4EO model). Bottom: Inter-method concatenation (Mean + Mean across different SSL objectives). For each task, the baseline is the stronger individual embedding, and we report $\Delta = \text{score}_{\text{concat}} - \text{score}_{\text{baseline}}$ (zero indicates no change). We additionally report the overall gain relative to the stronger overall baseline. Intra-method (Mean+CLS) concatenation yields only modest improvements (typically $<0.04$ in $R^2$ and $<1$ Q-score point), indicating substantial redundancy between token aggregation strategies. In contrast, inter-method (Mean+Mean) concatenation produces larger overall gains and consistent per-task improvements, reflecting complementary strengths across SSL objectives. While per-task deltas remain moderate, overall gains demonstrate that diversity in pretraining objectives contributes more to complementarity than alternative token aggregation within a single model.
Figure 5: Layer-wise task-averaged performance ($R^2$, left; Q-score, right). Top: ViT-Small; bottom: ResNet-50 (SSL4EO). Representations are extracted from each layer (12 transformer blocks; 5 ResNet stages with output dimensions 64, 256, 512, 1024, 2048); negative task values are clipped before averaging. ViT performance increases in early layers and then saturates, whereas ResNet shows an inverted-U pattern, peaking at intermediate stages and degrading at the final layer. A resized last-layer reference is included for ResNet.
...and 7 more figures

How To Embed Matters: Evaluation of EO Embedding Design Choices

TL;DR

Abstract

How To Embed Matters: Evaluation of EO Embedding Design Choices

Authors

TL;DR

Abstract

Table of Contents

Figures (12)