COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Miguel Espinosa; Eva Gmelich Meijling; Valerio Marsocci; Elliot J. Crowley; Mikolaj Czerkawski

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Miguel Espinosa, Eva Gmelich Meijling, Valerio Marsocci, Elliot J. Crowley, Mikolaj Czerkawski

TL;DR

COP-GEN is introduced, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions and enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining.

Abstract

Earth observation applications increasingly rely on data from multiple sensors, including optical, radar, elevation, and land-cover products. Relationships between these modalities are fundamental for data integration but are inherently non-injective: identical conditioning information can correspond to multiple physically plausible observations. Thus, such conditional mappings should be parametrised as data distributions. As a result, deterministic models tend to collapse toward conditional means and fail to represent the uncertainty and variability required for tasks such as data completion and cross-sensor translation. We introduce COP-GEN, a multimodal latent diffusion transformer that models the joint distribution of heterogeneous Earth Observation modalities at their native spatial resolutions. By parameterising cross-modal mappings as conditional distributions, COP-GEN enables flexible any-to-any conditional generation, including zero-shot modality translation, spectral band infilling, and generation under partial or missing inputs, without task-specific retraining. Experiments on a large-scale global multimodal dataset show that COP-GEN generates diverse yet physically consistent realisations while maintaining strong peak fidelity across optical, radar, and elevation modalities. Qualitative and quantitative analyses demonstrate that the model captures meaningful cross-modal structure and systematically adapts its output uncertainty as conditioning information increases. These results highlight the practical importance of stochastic generative modeling for Earth observation and motivate evaluation protocols that move beyond single-reference, pointwise metrics. Website: https:// miquel-espinosa.github.io/cop-gen

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

TL;DR

Abstract

Paper Structure (35 sections, 2 equations, 27 figures, 9 tables)

This paper contains 35 sections, 2 equations, 27 figures, 9 tables.

Introduction
Related Work
Generative Models and Diffusion Transformers
Generative Models for Earth Observation
Methodology
Dataset and Modalities
Latent Representation learning
VAE Training
Encoding of Geolocation and Time
Unified Multimodal Diffusion Model
Tokenization and Input Representation
Transformer Diffusion Backbone
Conditional and Unconditional Sampling
Joint Unconditional Generation
Any-to-Any Conditional Generation
...and 20 more sections

Figures (27)

Figure 1: Conditional generation of Sentinel-2 L2A imagery from topography (DEM) and land-cover (LULC) inputs. We condition COP-GEN generations on DEM and LULC inputs (geographic location is provided solely for visual reference). COP-GEN produces diverse and physically consistent outputs, demonstrating variability in spectral appearance, illumination, and atmospheric conditions while preserving topographic and land-cover constraints. This highlights the model’s ability to capture the inherent one-to-many relationships of multimodal Earth Observation data. LULC classes are visualized using the following color scheme: Water, Trees, Flooded vegetation, Crops, Built-up areas, Bare ground, Snow/ice, Clouds, and Rangeland. Additional qualitative results and visualisations are provided in the Supplementary Material.
Figure 2: COP-GEN architecture, training, and inference overview. Multimodal inputs (optical, radar, elevation, land-cover, geolocation, and timestamps) are encoded into latent representations using modality-specific VAEs (or directly tokenized for scalar inputs). All tokens, augmented with modality-specific diffusion timestep embeddings, are processed by a shared transformer diffusion backbone. The model is trained to jointly predict noise for all modalities. At inference, modalities can be either sampled from noise or fixed at timestep zero, enabling both unconditional generation and flexible any-to-any conditional translation across modalities.
Figure 3: Geospatial Distribution Analysis. We predict latitude--longitude coordinates conditioned on DEM and LULC inputs ($n=50$ runs). TerraMind (blue) collapses to a few locations, whereas COP-GEN (green) predicts a distribution of plausible locations that share similar topographic and biome characteristics, accurately reflecting the non-injective nature of the mapping. A Köppen--Geiger climate classification basemap is overlaid to provide climatic context for the predicted locations. The ground-truth acquisition location is indicated by a red star ($\textcolor{red}{\bigstar}$), and real thumbnail visualisations of the predicted locations are shown for comparison.
Figure 4: Distribution spread narrowing by progressively increasing input conditioning. As additional modalities are provided as input, the generated samples better align with the ground-truth (GT) distribution. For each conditioning setup, we generate 25 stochastic samples of Sentinel-2 L2A (S2L2A) imagery and report the predicted band distributions using histograms and kernel density estimates (KDEs). One spectral band is selected per S2L2A spatial resolution. The legend indicates the set of input modalities used for conditioning, always for a fixed geographic tile (215U_1019R). Additional bands and visualisations are provided in the appendix.
Figure 5: Per-pixel spectral reflectance profiles across multiple LULC classes. Conditioned on DEM and LULC inputs, COP-GEN generates multispectral S2L2A imagery that captures physically consistent spectral relationships. The plots compare spectral profiles from selected pixel locations in both real and generated images across the Sentinel-2 bands. The close alignment for trees, bare soil, water, crops, built-up areas, etc. demonstrates the model's ability to accurately reconstruct characteristic land-cover responses. Geographical location is provided for reference. Additional visualisations are provided in Supplementary Material.
...and 22 more figures

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

TL;DR

Abstract

COP-GEN: Latent Diffusion Transformer for Copernicus Earth Observation Data -- Generation Stochastic by Design

Authors

TL;DR

Abstract

Table of Contents

Figures (27)