Table of Contents
Fetching ...

AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, Loic Landrieu

TL;DR

AnySat tackles the fragmentation of Earth observation data by introducing a multimodal, self-supervised foundation model built on Joint Embedding Predictive Architecture (JEPA) and scale-adaptive patch encoding. Trained on GeoPlex, a curated collection of five multimodal EO datasets spanning 11 sensors and resolutions from 0.2 m to 250 m, AnySat learns modality-agnostic representations without decoders and generalizes to unseen sensor configurations. The approach yields state-of-the-art results across nine downstream tasks on GeoPlex and six external datasets, including land cover mapping, crop classification, tree species identification, change detection, and post-fire/flood segmentation, while maintaining efficiency in training and inference. This work demonstrates strong cross-modal generalization, rapid adaptation to new sensors, and a viable path toward scalable, global environmental monitoring with a single, reusable model.

Abstract

Geospatial models must adapt to the diversity of Earth observation data in terms of resolutions, scales, and modalities. However, existing approaches expect fixed input configurations, which limits their practical applicability. We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of 5 multimodal datasets with varying characteristics and $11$ distinct sensors. We then train a single powerful model on these diverse datasets simultaneously. Once fine-tuned or probed, we reach state-of-the-art results on the test sets of GeoPlex and for 6 external datasets across various environment monitoring tasks: land cover mapping, tree species identification, crop type classification, change detection, climate type classification, and segmentation of flood, burn scar, and deforestation. The code and models are available at https://github.com/gastruc/AnySat.

AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modalities

TL;DR

AnySat tackles the fragmentation of Earth observation data by introducing a multimodal, self-supervised foundation model built on Joint Embedding Predictive Architecture (JEPA) and scale-adaptive patch encoding. Trained on GeoPlex, a curated collection of five multimodal EO datasets spanning 11 sensors and resolutions from 0.2 m to 250 m, AnySat learns modality-agnostic representations without decoders and generalizes to unseen sensor configurations. The approach yields state-of-the-art results across nine downstream tasks on GeoPlex and six external datasets, including land cover mapping, crop classification, tree species identification, change detection, and post-fire/flood segmentation, while maintaining efficiency in training and inference. This work demonstrates strong cross-modal generalization, rapid adaptation to new sensors, and a viable path toward scalable, global environmental monitoring with a single, reusable model.

Abstract

Geospatial models must adapt to the diversity of Earth observation data in terms of resolutions, scales, and modalities. However, existing approaches expect fixed input configurations, which limits their practical applicability. We propose AnySat, a multimodal model based on joint embedding predictive architecture (JEPA) and scale-adaptive spatial encoders, allowing us to train a single model on highly heterogeneous data in a self-supervised manner. To demonstrate the advantages of this unified approach, we compile GeoPlex, a collection of 5 multimodal datasets with varying characteristics and distinct sensors. We then train a single powerful model on these diverse datasets simultaneously. Once fine-tuned or probed, we reach state-of-the-art results on the test sets of GeoPlex and for 6 external datasets across various environment monitoring tasks: land cover mapping, tree species identification, crop type classification, change detection, climate type classification, and segmentation of flood, burn scar, and deforestation. The code and models are available at https://github.com/gastruc/AnySat.

Paper Structure

This paper contains 55 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Multi-Dataset Training. For the first time, a single model can be pretrained simultaneously on a collection of Earth Observation datasets with heterogeneous resolutions, scales, and modalities. The resulting model can be fine-tuned to achieve state-of-the-art results for a wide variety of data types and tasks.
  • Figure 2: Scale-Adaptive Patch Encoding. We consider a patch $x^m_p$ of resolution $\Delta_m=P/R_m$ pixels. We first split $x^m_p$ into sub-patches of size $\delta_m$ pixels, which are mapped by a modality-specific projector $\phi^\text{proj}_m$ to a $E$-dimensional embedding. Then, a shared spatial transformer module $\phi^\text{trans}$ combines all sub-patches into a vector of size $E$. As the sub-patch size $\delta_m$ is fixed, the patch sizes $\Delta_m$ only influences the number of input tokens to $\phi^\text{trans}$, allowing us to use the same network for different resolutions.
  • Figure 3: Architecture of AnySat. We begin each iteration by randomly selecting a dataset among GeoPlex and sampling a tile. Each available modality is divided into spatially aligned patches of size $P$. The student network's patch encoder $\phi^\text{patch}_\mathcal{S}$ embeds each patch and we apply a contrastive loss to encourage spatial consistency across modalities. We then apply dropping and masking : some patches have all modalities removed (dropping), while others have only random modalities removed (masking). The remaining patches are merged in the modality combiner $\phi^\text{comb}_\mathcal{S}$ to form multimodal representations $f^\star_\mathcal{S}$ for the non-dropped patches. The predictor $\phi^\text{pred}_\mathcal{S}$ then reconstructs the embeddings of the dropped patches. Finally, the student network's output is compared to the teacher's, whose weights are an Exponential Moving Average (EMA) of the student's weights and which processes the complete set of patches without masking or dropping.
  • Figure 4: Datasets Considered. GeoPlex is composed of $5$ diverse dataset spanning the entire world, with a higher concentration in Europe and the US where open-data are more abundant. We also consider external evaluation datasets with a more diverse spread.
  • Figure 5: Quantitative Evaluation. We evaluate AnySat across 9 open-access datasets and for four tasks: multilabel classification (classif), semantic segmentation (semseg), pixel-wise change detection (chgdet), and pixel-wise regression (regression). For clarity, we only visualize the four best performance per dataset, see Appendix for full results. We report the number of trainable parameters for probing evaluations.
  • ...and 2 more figures