Table of Contents
Fetching ...

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

Nicolas Houdré, Diego Marcos, Hugo Riffaud de Turckheim, Dino Ienco, Laurent Wendling, Camille Kurtz, Sylvain Lobry

TL;DR

RAMEN tackles EO heterogeneity by learning a modality-agnostic, resolution-adjustable encoder that unifies optical, radar, and elevation inputs in a shared latent space. It introduces channel-conditioned projections, a GSD-aware spatial resampler with a four-expert mixture, and a temporal attention module, all trained with a self-supervised MAE objective across diverse sensors and resolutions. Pretrained on a large, heterogeneous corpus, RAMEN transfers effectively to unseen sensors and achieves state-of-the-art results on the PANGAEA benchmark, while enabling explicit compute–accuracy trade-offs at inference via a controllable target GSD. This work demonstrates a scalable pathway toward general-purpose EO foundation models capable of adapting to diverse sensor configurations and application needs.

Abstract

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

RAMEN: Resolution-Adjustable Multimodal Encoder for Earth Observation

TL;DR

RAMEN tackles EO heterogeneity by learning a modality-agnostic, resolution-adjustable encoder that unifies optical, radar, and elevation inputs in a shared latent space. It introduces channel-conditioned projections, a GSD-aware spatial resampler with a four-expert mixture, and a temporal attention module, all trained with a self-supervised MAE objective across diverse sensors and resolutions. Pretrained on a large, heterogeneous corpus, RAMEN transfers effectively to unseen sensors and achieves state-of-the-art results on the PANGAEA benchmark, while enabling explicit compute–accuracy trade-offs at inference via a controllable target GSD. This work demonstrates a scalable pathway toward general-purpose EO foundation models capable of adapting to diverse sensor configurations and application needs.

Abstract

Earth observation (EO) data spans a wide range of spatial, spectral, and temporal resolutions, from high-resolution optical imagery to low resolution multispectral products or radar time series. While recent foundation models have improved multimodal integration for learning meaningful representations, they often expect fixed input resolutions or are based on sensor-specific encoders limiting generalization across heterogeneous EO modalities. To overcome these limitations we introduce RAMEN, a resolution-adjustable multimodal encoder that learns a shared visual representation across EO data in a fully sensor-agnostic manner. RAMEN treats the modality and spatial and temporal resolutions as key input data features, enabling coherent analysis across modalities within a unified latent space. Its main methodological contribution is to define spatial resolution as a controllable output parameter, giving users direct control over the desired level of detail at inference and allowing explicit trade-offs between spatial precision and computational cost. We train a single, unified transformer encoder reconstructing masked multimodal EO data drawn from diverse sources, ensuring generalization across sensors and resolutions. Once pretrained, RAMEN transfers effectively to both known and unseen sensor configurations and outperforms larger state-of-the-art models on the community-standard PANGAEA benchmark, containing various multi-sensor and multi-resolution downstream tasks. Our code and pretrained model are available at https://github.com/nicolashoudre/RAMEN.

Paper Structure

This paper contains 30 sections, 7 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Visual workflow of RAMEN. RAMEN enables consistent adaptation across multimodal imagery via resolution-specific projection modules. Thanks to our adjustable resampling strategy and pretraining scheme, RAMEN allows practitioners to define the feature map spatial resolution of encoded inputs, allowing fine-grained representations and trade-offs between downstream performances and computational overhead.
  • Figure 2: Architecture of RAMEN. At each iteration, a subset of modalities and a target ground sampling distance (GSD) are sampled. Each selected modality is projected into a shared latent space through three resolution-specific modules: (a) A channel-conditioned projector that embeds the physical meaning of each channel; (b) An adjustable spatial resampler maps features to the user-defined $GSD_{target}$ with scale-aware convolutional adaptation; (c) A temporal attention module treats time-series enriched with the day of acquisition encoding. We adopt a masked image modeling pretraining scheme, where we reconstruct each modality at its native spectral, spatial and temporal resolution with inverted modules.
  • Figure 3: Compute/performance trade-off across four downstream tasks. We plot mIoU versus average GFLOPs per test tile for RAMEN at various target spatial resolutions compared to fixed-resolution foundation models.
  • Figure A: Specialization of the convolutional experts across interpolation ratios. We plot the normalized weights of the four convolutional experts as a function of the interpolation ratio $\text{GSD}_m/\text{GSD}_{target}$.
  • Figure B: Compute/performance trade-off across eight downstream tasks. We plot mIoU versus average GFLOPs per test tile for RAMEN at various target spatial resolutions compared to fixed-resolution foundation models.
  • ...and 1 more figures