Table of Contents
Fetching ...

COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails

Miguel Espinosa, Valerio Marsocci, Yuru Jia, Elliot J. Crowley, Mikolaj Czerkawski

TL;DR

COP-GEN-Beta tackles the problem of learning a unified generative prior across multiple Copernicus EO modalities. It introduces a transformer-based diffusion model that processes four modalities (DEM, S1 RTC, S2L1C, S2L2A) as a shared latent sequence with modality-specific timesteps, enabling zero-shot translation between any subset of modalities. The approach delivers both quantitative gains over a diffusion-based baseline and rich qualitative capabilities, such as atmospheric correction and elevation estimation, while supporting flexible sampling modes and easy extension to new data sources. This work lays a foundation for powerful, generalist pre-trained models in Earth observation with practical impact on sensor fusion and data augmentation across diverse applications.

Abstract

In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

COP-GEN-Beta: Unified Generative Modelling of COPernicus Imagery Thumbnails

TL;DR

COP-GEN-Beta tackles the problem of learning a unified generative prior across multiple Copernicus EO modalities. It introduces a transformer-based diffusion model that processes four modalities (DEM, S1 RTC, S2L1C, S2L2A) as a shared latent sequence with modality-specific timesteps, enabling zero-shot translation between any subset of modalities. The approach delivers both quantitative gains over a diffusion-based baseline and rich qualitative capabilities, such as atmospheric correction and elevation estimation, while supporting flexible sampling modes and easy extension to new data sources. This work lays a foundation for powerful, generalist pre-trained models in Earth observation with practical impact on sensor fusion and data augmentation across diverse applications.

Abstract

In remote sensing, multi-modal data from various sensors capturing the same scene offers rich opportunities, but learning a unified representation across these modalities remains a significant challenge. Traditional methods have often been limited to single or dual-modality approaches. In this paper, we introduce COP-GEN-Beta, a generative diffusion model trained on optical, radar, and elevation data from the Major TOM dataset. What sets COP-GEN-Beta apart is its ability to map any subset of modalities to any other, enabling zero-shot modality translation after training. This is achieved through a sequence-based diffusion transformer, where each modality is controlled by its own timestep embedding. We extensively evaluate COP-GEN-Beta on thumbnail images from the Major TOM dataset, demonstrating its effectiveness in generating high-quality samples. Qualitative and quantitative evaluations validate the model's performance, highlighting its potential as a powerful pre-trained model for future remote sensing tasks.

Paper Structure

This paper contains 27 sections, 13 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: By training on dense, global coverage COP-GEN-Beta has captured a wide and diverse data distribution of the supported modalities. It is possible to observe emergent effects such as seasonality when sampling multiple images conditioned on the same S1RTC sample, despite having trained on only one temporal sample for each location in the world (since Major TOM does not provide multi-temporal data). COP-GEN-Beta is capable of synthetising new locations that do not exist, but also it can reimagine existing locations in conditions that were never observed. Best viewed when zoomed in.
  • Figure 2: COP-GEN-Beta is the first generative model trained on the joint distribution of Sentinel-2 (both L1C and L2A), Sentinel-1 RTC, and Copernicus GLO-30 DEM data. This is done through (a) sampling a global and dense dataset of these modalities from Major TOM, encoding all images with a pretrained StableDiffusion autoencoder, and (b) training a sequence-based denoising diffusion model using a transformer backbone, where each modality is supplied with its designated timestep. This approach makes it possible to (c) generate all modalities based on any subset thereof that is available.
  • Figure 3: COP-GEN-Beta supports translation between processing levels of Sentinel-2, which emulates the official procesor for the L2A level (approximating Bottom-of-Atmosphere based on Top-of-Atmosphere input), but it can also do the opposite and approximate a possible L1C product from an observed L2A observation, which is equivalent to the synthesis of the atmospheric effect. Best viewed when zoomed in.
  • Figure 4: COP-GEN-Beta can map any modality to an elevation model estimate, which can be highly useful for dynamically changing terrains. Here, it is shown how any modality (Sentinel-1 or Sentinel-2) can be used to approximate the elevation, even radiometrically terrain-corrected radar product. Best viewed when zoomed in.
  • Figure 5: Looping. We illustrate the iterative process of conditioning the model on its own generated outputs. Starting from a real Sentinel-2 L2A image (left), the model first generates multiple corresponding Sentinel-1 RTC image (middle), which is then used to synthesize a new Sentinel-2 L2A image (right). Best viewed when zoomed in. Longer loop sequences can be found in the Supplementary Material.
  • ...and 4 more figures