Table of Contents
Fetching ...

Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space

Mikolaj Czerkawski, Marcin Kluczek, Jędrzej S. Bojanowski

TL;DR

This work tackles the challenge of scalable, semantic representations for global Earth observation data by extending the Major TOM project to produce dense, open embeddings. It introduces a standardized embedding expansion pipeline that fragments Major TOM grid cells, preprocesses data, applies four pre-trained models (SSL4EO-S2, SSL4EO-S1RTC, SigLIP, DINOv2) to generate embeddings, and stores results with rich metadata in GeoParquet. The release includes four global embedding datasets over 3.5 million images and 9.368 trillion pixels, enabling reproducible evaluation and fast downstream tasks like land-use monitoring. The work also provides software tooling for embedding generation, querying, and evaluation, laying groundwork for scalable, cross-model analysis across time and space in Earth observation.

Abstract

With the ever-increasing volumes of the Earth observation data present in the archives of large programmes such as Copernicus, there is a growing need for efficient vector representations of the underlying raw data. The approach of extracting feature representations from pretrained deep neural networks is a powerful approach that can provide semantic abstractions of the input data. However, the way this is done for imagery archives containing geospatial data has not yet been defined. In this work, an extension is proposed to an existing community project, Major TOM, focused on the provision and standardization of open and free AI-ready datasets for Earth observation. Furthermore, four global and dense embedding datasets are released openly and for free along with the publication of this manuscript, resulting in the most comprehensive global open dataset of geospatial visual embeddings in terms of covered Earth's surface.

Global and Dense Embeddings of Earth: Major TOM Floating in the Latent Space

TL;DR

This work tackles the challenge of scalable, semantic representations for global Earth observation data by extending the Major TOM project to produce dense, open embeddings. It introduces a standardized embedding expansion pipeline that fragments Major TOM grid cells, preprocesses data, applies four pre-trained models (SSL4EO-S2, SSL4EO-S1RTC, SigLIP, DINOv2) to generate embeddings, and stores results with rich metadata in GeoParquet. The release includes four global embedding datasets over 3.5 million images and 9.368 trillion pixels, enabling reproducible evaluation and fast downstream tasks like land-use monitoring. The work also provides software tooling for embedding generation, querying, and evaluation, laying groundwork for scalable, cross-model analysis across time and space in Earth observation.

Abstract

With the ever-increasing volumes of the Earth observation data present in the archives of large programmes such as Copernicus, there is a growing need for efficient vector representations of the underlying raw data. The approach of extracting feature representations from pretrained deep neural networks is a powerful approach that can provide semantic abstractions of the input data. However, the way this is done for imagery archives containing geospatial data has not yet been defined. In this work, an extension is proposed to an existing community project, Major TOM, focused on the provision and standardization of open and free AI-ready datasets for Earth observation. Furthermore, four global and dense embedding datasets are released openly and for free along with the publication of this manuscript, resulting in the most comprehensive global open dataset of geospatial visual embeddings in terms of covered Earth's surface.

Paper Structure

This paper contains 18 sections, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: The pipeline building Major TOM embedding expansions according to the proposed standard. It begins with grid cell fragmenting, image preprocessing, and embedding and packing into the geoparquet archive format.
  • Figure 2: Fragmenting function for SigLIP (fragments of 384 pixels) for 2 independent Major TOM grid cells plotted next to each other. Note that these cells (green and red) are fragmented and processed independently, and are plotted here together for visualisation.
  • Figure 3: Individual fragments for 2 independent Major TOM grid cells plotted next to each other.
  • Figure 4: Principal component analysis with 3 components mapped to RGB channels (larger format of the same images is available in Appendix \ref{['app:images']}
  • Figure 5: Principal component analysis with 3 components mapped to RGB channels for SigLIP-SO400M
  • ...and 3 more figures