MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

Lucia Gordon; Serge Belongie; Christian Igel; Nico Lang

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

Lucia Gordon, Serge Belongie, Christian Igel, Nico Lang

TL;DR

This work introduces MMEarth-Bench, a globally distributed, multimodal geospatial benchmark comprising five tasks (biomass, soil nitrogen, soil organic carbon, soil pH, and species occurrence) with 12 modalities, designed to probe pretrained models under data sparsity and geographic distribution shifts. It demonstrates that multimodal pretraining generally improves robustness in limited-label regimes but geographic generalization remains challenging. To address test-time adaptation, the authors propose a model-agnostic multimodal test-time training framework (TTT-MMR) that uses all available modalities as reconstruction targets, with a geographic batching variant (TTT-MMR-Geo) to balance regularization and specialization. Across multiple pretrained architectures, TTT-MMR and TTT-MMR-Geo yield consistent improvements over joint training on random and geographic splits, supporting their potential as practical, scalable adaptation strategies for global EO tasks. The work also provides extensive dataset tooling, an Explorer for visualization, and guidance for practitioners to apply the method to new tasks, highlighting both the promise and remaining gaps in geographic generalization for multimodal Earth observation models.

Abstract

Recent research in geospatial machine learning has demonstrated that models pretrained with self-supervised learning on Earth observation data can perform well on downstream tasks with limited training data. However, most of the existing geospatial benchmark datasets have few data modalities and poor global representation, limiting the ability to evaluate multimodal pretrained models at global scales. To fill this gap, we introduce MMEarth-Bench, a collection of five new multimodal environmental tasks with 12 modalities, globally distributed data, and both in- and out-of-distribution test splits. We benchmark a diverse set of pretrained models and find that while (multimodal) pretraining tends to improve model robustness in limited data settings, geographic generalization abilities remain poor. In order to facilitate model adaptation to new downstream tasks and geographic domains, we propose a model-agnostic method for test-time training with multimodal reconstruction (TTT-MMR) that uses all the modalities available at test time as auxiliary tasks, regardless of whether a pretrained model accepts them as input. Our method improves model performance on both the random and geographic test splits, and geographic batching leads to a good trade-off between regularization and specialization during TTT. Our dataset, code, and visualization tool are linked from the project page at lgordon99.github.io/mmearth-bench.

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

TL;DR

Abstract

Paper Structure (56 sections, 5 equations, 27 figures, 40 tables)

This paper contains 56 sections, 5 equations, 27 figures, 40 tables.

Introduction
Related work
Low-shot learning
Domain adaptation
Existing benchmark datasets
MMEarth-Bench Dataset
Methodology
Experimental results
Setup
Transfer learning with limited reference data
Geographic generalization challenge
The effect of multimodal input data
Multimodal test-time training performance
Conclusion
Acknowledgments
...and 41 more sections

Figures (27)

Figure 1: Self-supervised multimodal pretraining promises to overcome the grand challenges in Earth observation. Crucial applications have to rely on limited and sparse and geographically biased training data. Furthermore, the ambiguities inherent to modeling biophysical quantities with remotely sensed data may be resolved by models conditioned on multiple modalities.
Figure 2: Data splits in MMEarth-Bench. Each of the 5 tasks consists of a geographic test split ("Africa") and splits the rest of the world randomly into training (70%), validation (15%), and random test (15%). While the full training dataset is shown here, we also provide subsets with 50% and 5% of the training data for even lower-shot experiments.
Figure 3: TTT with multimodal reconstruction (TTT-MMR). (1) We adapt a pretrained encoder to a downstream task by jointly training the encoder together with the main task decoder $g$ and task modality decoder $h$. At test time, for each batch: (2) the modality reconstruction losses are used to adapt the encoder iteratively, and (3) the adapted encoder is used to yield improved predictions.
Figure 4: Low-shot in-distribution performance. Finetuning on subsets of the training data. Symbology: $\bullet$=RGB, $\blacksquare$=S2, $\blacktriangle$=multimodal, solid=random init., dashed=pretrained.
Figure 5: Geographic generalization. Performance comparison on random (R) vs. geographic (G) test splits using all training data.
...and 22 more figures

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

TL;DR

Abstract

MMEarth-Bench: Global Model Adaptation via Multimodal Test-Time Training

Authors

TL;DR

Abstract

Table of Contents

Figures (27)