Table of Contents
Fetching ...

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Joshua Hansen, Andrew Howe, Patrick Alan Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner, Evan Shelhamer, Ali Farhadi, Ranjay Krishna, Patrick Beukema

TL;DR

OlmoEarth tackles the challenge of training robust, multimodal, spatio-temporal foundation models for Earth observation by introducing a stable latent-space training regime. The method combines Latent MIM Lite with modality-aware masking and a dual loss: patch-discrimination in latent space and an instance-contrastive objective over pooled embeddings, yielding strong performance across 18 benchmarks and 19 partner tasks. The authors further provide an open, end-to-end OlmoEarth Platform that supports data curation, labeling, training, and inference to empower NGOs and humanitarian efforts, alongside transparent reporting of environmental impact. The work demonstrates substantial embedding- and fine-tuning-stage gains, emphasizes open accessibility, and outlines a path toward broader adoption of advanced remote-sensing models in mission-driven contexts.

Abstract

Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at $\href{https://github.com/allenai/olmoearth_pretrain}{\text{https://github.com/allenai/olmoearth_pretrain}}$.

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

TL;DR

OlmoEarth tackles the challenge of training robust, multimodal, spatio-temporal foundation models for Earth observation by introducing a stable latent-space training regime. The method combines Latent MIM Lite with modality-aware masking and a dual loss: patch-discrimination in latent space and an instance-contrastive objective over pooled embeddings, yielding strong performance across 18 benchmarks and 19 partner tasks. The authors further provide an open, end-to-end OlmoEarth Platform that supports data curation, labeling, training, and inference to empower NGOs and humanitarian efforts, alongside transparent reporting of environmental impact. The work demonstrates substantial embedding- and fine-tuning-stage gains, emphasizes open accessibility, and outlines a path toward broader adoption of advanced remote-sensing models in mission-driven contexts.

Abstract

Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at .

Paper Structure

This paper contains 34 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: OlmoEarth defines a Pareto optimum of performance vs. computational efficiency averaged across 13 embedding tasks (measured by kNN and linear probing). The chart shows average multiply-accumulate operations to encode one example across all tasks (input size varies by task). See Table \ref{['tab:knnlp']} for full results.
  • Figure 2: Global distribution of OlmoEarth pretraining data. We sample 285,288 locations based on OpenStreetMap categories.
  • Figure 3: We train OlmoEarth with a combination of satellite observations and high-quality maps. After tokenizing these inputs, we: (1) apply a modality-aware masking strategy to define which tokens are inputs vs. targets, (2) pass the target tokens through fixed random projections to construct targets, (3) pass the input tokens through our learned encoders, and then (4) through a decoder which predicts the target tokens and (5) apply a modality-aware patch discrimination loss between the predicted and target tokens. Steps 1-5 are applied twice on the same data to then (6) apply an instance contrastive loss over the aggregated tokens per instance.
  • Figure 4: Results of a fine-tuned ecosystem classification model in the OlmoEarth Platform. Users can label data, fine-tune models, and run inference to generate maps all in the OlmoEarth Platform.
  • Figure 5: An example instance from the m_cashew_plant dataset: note the coarse, polygonal labels
  • ...and 2 more figures