Table of Contents
Fetching ...

Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing

Kursat Komurcu, Linas Petkevicius

Abstract

Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on https://github.com/VU-AIML/SAT-JEPA-DIFF.

Sat-JEPA-Diff: Bridging Self-Supervised Learning and Generative Diffusion for Remote Sensing

Abstract

Predicting satellite imagery requires a balance between structural accuracy and textural detail. Standard deterministic methods like PredRNN or SimVP minimize pixel-based errors but suffer from the "regression to the mean" problem, producing blurry outputs that obscure subtle geographic-spatial features. Generative models provide realistic textures but often misleadingly reveal structural anomalies. To bridge this gap, we introduce Sat-JEPA-Diff, which combines Self-Supervised Learning (SSL) with Hidden Diffusion Models (LDM). An IJEPA module predicts stable semantic representations, which then route a frozen Stable Diffusion backbone via a lightweight cross-attention adapter. This ensures that the synthesized high-accuracy textures are based on absolutely accurate structural predictions. Evaluated on a global Sentinel-2 dataset, Sat-JEPA-Diff excels at resolving sharp boundaries. It achieves leading perceptual scores (GSSIM: 0.8984, FID: 0.1475) and significantly outperforms deterministic baselines, despite standard autoregressive stability limits. The code and dataset are publicly available on https://github.com/VU-AIML/SAT-JEPA-DIFF.
Paper Structure (19 sections, 8 equations, 5 figures, 1 table)

This paper contains 19 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of Sat-JEPA-Diff. The IJEPA module (left) predicts future semantic embeddings $\hat{z}_{t+1}$ from input $I_t$. These embeddings, combined with coarse spatial structure, condition a frozen SD3.5 backbone via a learned adapter to generate $\hat{I}_{t+1}$.
  • Figure 2: Qualitative comparison of next-frame predictions ($t \to t+1$). While deterministic baselines (PredRNN, SimVP) suffer from spectral blurring, Sat-JEPA-Diff preserves high-frequency details and geospatial boundaries.
  • Figure 3: Geographical distribution of the 100 selected Regions of Interest (RoIs).
  • Figure 4: Systematic IJEPA Loss Ablation. Validation metrics over 100 epochs demonstrate that our full objective function (E curve) uniquely avoids representation collapse and maintains high embedding variance compared to near-zero variance in the reduced underlying models (A-D curves). Despite higher total loss, the full model maintains high cosine similarity and achieves superior spatial variance.
  • Figure 5: Long-horizon autoregressive rollout comparison ($2018 \to 2024$) on the Rio de Janeiro Coast. Top Row: Ground Truth. Rows 2-3: Deterministic baselines rapidly degrade into spectral blurring (spatial collapse) after 2-3 steps. Bottom Row: Sat-JEPA-Diff maintains high contrast and structural sharpness throughout the 7-year horizon.