Table of Contents
Fetching ...

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

Zhenyuan Chen, Guanyuan Shen, Feng Zhang

TL;DR

This paper presents EarthBridge, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T), which achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard.

Abstract

Cross-modal image-to-image translation among Electro-Optical (EO), Infrared (IR), and Synthetic Aperture Radar (SAR) sensors is essential for comprehensive multi-modal aerial-view analysis. However, translating between these modalities is notoriously difficult due to their distinct electromagnetic signatures and geometric characteristics. This paper presents \textbf{EarthBridge}, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T). We explore two distinct methodologies: \textbf{Diffusion Bridge Implicit Models (DBIM)}, which we generalize using non-Markovian bridge processes for high-quality deterministic sampling, and \textbf{Contrastive Unpaired Translation (CUT)}, which utilizes contrastive learning for structural consistency. Our EarthBridge framework employs a channel-concatenated UNet denoiser trained with Karras-weighted bridge scalings and a specialized "booting noise" initialization to handle the inherent ambiguity in cross-modal mappings. We evaluate these methods across all four challenge tasks (SAR$\rightarrow$EO, SAR$\rightarrow$RGB, SAR$\rightarrow$IR, RGB$\rightarrow$IR), achieving superior spatial detail and spectral accuracy. Our solution achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard. Code is available at https://github.com/Bili-Sakura/EarthBridge-Preview.

EarthBridge: A Solution for 4th Multi-modal Aerial View Image Challenge Translation Track

TL;DR

This paper presents EarthBridge, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T), which achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard.

Abstract

Cross-modal image-to-image translation among Electro-Optical (EO), Infrared (IR), and Synthetic Aperture Radar (SAR) sensors is essential for comprehensive multi-modal aerial-view analysis. However, translating between these modalities is notoriously difficult due to their distinct electromagnetic signatures and geometric characteristics. This paper presents \textbf{EarthBridge}, a high-fidelity translation framework developed for the 4th Multi-modal Aerial View Image Challenge -- Translation (MAVIC-T). We explore two distinct methodologies: \textbf{Diffusion Bridge Implicit Models (DBIM)}, which we generalize using non-Markovian bridge processes for high-quality deterministic sampling, and \textbf{Contrastive Unpaired Translation (CUT)}, which utilizes contrastive learning for structural consistency. Our EarthBridge framework employs a channel-concatenated UNet denoiser trained with Karras-weighted bridge scalings and a specialized "booting noise" initialization to handle the inherent ambiguity in cross-modal mappings. We evaluate these methods across all four challenge tasks (SAREO, SARRGB, SARIR, RGBIR), achieving superior spatial detail and spectral accuracy. Our solution achieved a composite score of 0.38, securing the second position on the MAVIC-T leaderboard. Code is available at https://github.com/Bili-Sakura/EarthBridge-Preview.
Paper Structure (38 sections, 7 equations, 4 figures, 4 tables)

This paper contains 38 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Distribution evolution of the diffusion bridge. The source distribution at $\bm{z}_T=\bm{x}$ (left) converges through time $t$ to the target distribution at $\bm{z}_0=\bm{y}$ (right), governed by the DBIM update rule and denoiser $D_\theta$.
  • Figure 2: Deterministic DBIM sampling. The noisy input $\bm{z}_T=\bm{x}+\bm{\epsilon}_{\text{boot}}$ is progressively denoised by $D_\theta(\bm{z}_t, t, \bm{x})$ to produce the clean output $\bm{z}_0=\bm{y}$. The step sequence shows conditioning, booting noise injection, and gradual refinement from $t=T$ to $t=0$.
  • Figure 3: UNet denoiser architecture for 1024px tasks (SAR$\rightarrow$RGB, RGB$\rightarrow$IR). The encoder downsamples via ResBlocks and average pooling; the decoder upsamples with skip connections. Source $\bm{x}$ is channel-concatenated with $\bm{z}_t$; the output is $\hat{\bm{z}}_0$. We train at 512$\times$512 and infer at full resolution.
  • Figure 4: Qualitative results across all four MAVIC-T translation tasks. Each row corresponds to one task (SAR$\rightarrow$EO, SAR$\rightarrow$RGB, SAR$\rightarrow$IR, RGB$\rightarrow$IR). Columns show: source image, EarthBridge output, and ground-truth target. Our model preserves structural layout from the source while synthesizing faithful target-modality textures across both 256$\times$256 and 1024$\times$1024 resolutions.