Table of Contents
Fetching ...

EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

Nils Lehmann, Yi Wang, Zhitong Xiong, Xiaoxiang Zhu

TL;DR

EO-VAE tackles multisensor Earth Observation tokenization by introducing a single modality-agnostic variational autoencoder whose first and last layers are dynamic hypernetworks conditioned on channel wavelengths $\lambda$. It is trained in a two-stage process with weight distillation using $\mathcal{L} = \|W_T - W_S\|$ and end-to-end reconstruction with $\mathcal{L}_{\mathrm{rec}}$, yielding reconstructed images $\hat{x}$ via $\hat{x} = D_{\theta_D}(E_{\theta_E}(x;\lambda);\lambda)$. On TerraMesh, EO-VAE outperforms TerraMind across S2L2A and S1RTC with substantial improvements in PSNR and NDVI MAE, demonstrating higher fidelity and physical consistency. As a frozen latent tokenizer, it enables a Latent Diffusion Model for super-resolution that matches RGB Flux.2 quality while delivering about $18\times$ efficiency gains over pixel-space diffusion, evidencing practical speedups for EO pipelines. The work provides a scalable baseline for latent EO modeling and points toward future expansion to more sensors, channel configurations, and spatio-temporal architectures.

Abstract

State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.

EO-VAE: Towards A Multi-sensor Tokenizer for Earth Observation Data

TL;DR

EO-VAE tackles multisensor Earth Observation tokenization by introducing a single modality-agnostic variational autoencoder whose first and last layers are dynamic hypernetworks conditioned on channel wavelengths . It is trained in a two-stage process with weight distillation using and end-to-end reconstruction with , yielding reconstructed images via . On TerraMesh, EO-VAE outperforms TerraMind across S2L2A and S1RTC with substantial improvements in PSNR and NDVI MAE, demonstrating higher fidelity and physical consistency. As a frozen latent tokenizer, it enables a Latent Diffusion Model for super-resolution that matches RGB Flux.2 quality while delivering about efficiency gains over pixel-space diffusion, evidencing practical speedups for EO pipelines. The work provides a scalable baseline for latent EO modeling and points toward future expansion to more sensors, channel configurations, and spatio-temporal architectures.

Abstract

State-of-the-art generative image and video models rely heavily on tokenizers that compress high-dimensional inputs into more efficient latent representations. While this paradigm has revolutionized RGB generation, Earth observation (EO) data presents unique challenges due to diverse sensor specifications and variable spectral channels. We propose EO-VAE, a multi-sensor variational autoencoder designed to serve as a foundational tokenizer for the EO domain. Unlike prior approaches that train separate tokenizers for each modality, EO-VAE utilizes a single model to encode and reconstruct flexible channel combinations via dynamic hypernetworks. Our experiments on the TerraMesh dataset demonstrate that EO-VAE achieves superior reconstruction fidelity compared to the TerraMind tokenizers, establishing a robust baseline for latent generative modeling in remote sensing.
Paper Structure (12 sections, 2 equations, 6 figures, 4 tables)

This paper contains 12 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: EO-VAE Architecture and Training Regime. The first and last convolutional layer of the Flux.2 Autoencoder architecture are replaced with dynamic convolution hypernetworks xiong2024neural. After weight distillation of the frozen Flux.2 convolutional weights, we finetune end-to-end on the multimodal TerraMesh dataset.
  • Figure 2: Qualitative samples of reconstructed modalities. EO-VAE reconstructs details in both modalities better than the TerraMind tokenizers.
  • Figure 3: Qualitative Results between EO-VAE and Flux-VAE for reconstructed super-resolution predictions.
  • Figure 4: channelwise histogram of raw unnormalized data for the S2L2A modality, showing the range of > 10000.
  • Figure 5: Minimum sample values plotted across time. The processing baseline change on January 22, 2022 becomes clearly visible.
  • ...and 1 more figures