Table of Contents
Fetching ...

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

Haoyang Chen, Jing Zhang, Hebaixu Wang, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haonan Guo, Di Wang, Zheng Wang, Bo Du

TL;DR

RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, is introduced, providing supervision anchors for any-to-any translation and shows that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs.

Abstract

Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models will be available at https://github.com/MiliLab/Any2Any.

Any2Any: Unified Arbitrary Modality Translation for Remote Sensing

TL;DR

RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, is introduced, providing supervision anchors for any-to-any translation and shows that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs.

Abstract

Multi-modal remote sensing imagery provides complementary observations of the same geographic scene, yet such observations are frequently incomplete in practice. Existing cross-modal translation methods treat each modality pair as an independent task, resulting in quadratic complexity and limited generalization to unseen modality combinations. We formulate Any-to-Any translation as inference over a shared latent representation of the scene, where different modalities correspond to partial observations of the same underlying semantics. Based on this formulation, we propose Any2Any, a unified latent diffusion framework that projects heterogeneous inputs into a geometrically aligned latent space. Such structure performs anchored latent regression with a shared backbone, decoupling modality-specific representation learning from semantic mapping. Moreover, lightweight target-specific residual adapters are used to correct systematic latent mismatches without increasing inference complexity. To support learning under sparse but connected supervision, we introduce RST-1M, the first million-scale remote sensing dataset with paired observations across five sensing modalities, providing supervision anchors for any-to-any translation. Experiments across 14 translation tasks show that Any2Any consistently outperforms pairwise translation methods and exhibits strong zero-shot generalization to unseen modality pairs. Code and models will be available at https://github.com/MiliLab/Any2Any.
Paper Structure (30 sections, 13 equations, 7 figures, 4 tables)

This paper contains 30 sections, 13 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison of PSNR across 14 modality translation tasks on our proposed RST-1M dataset between our method of different versions (i.e., Any2Any-S, Any2Any-B & Any2Any-L) and representative image-to-image translation approaches (Pix2Pix isola2017image, Pix2PixHD wang2018high, BBDM li2023bbdm, ControlNet zhang2023adding and LBM Chadebec_2025_ICCV). Our Any2Any-L consistently outperforms the existing top-1 method on every modality pair, with per-modality performance gains highlighted in purple.
  • Figure 2: Statistics and example images of the RST-1M dataset. (a) Modality pair distribution of our RST-1M dataset, derived from five public datasets (SEN12MS, CACo, SEN1-2, Spacenet-5, and Spacenet-3). (b) Sample count for each of the seven modality pairs. (c) Statistics of the five modalities (PAN, MS, NIR, SAR, and RGB), including spatial resolution, representative examples, and image counts.
  • Figure 3: Overview of the Any2Any framework. The framework decouples modality-specific representation learning from shared semantic mapping: (i) independent VAEs $\{E_i, D_i\}$ to establish a dimensionally unified latent manifold $\mathcal{Z}$ across heterogeneous sensors; (ii) a shared Diffusion Transformer (DiT) $f_\theta$ that executes $x_0$-prediction steered by an MLP-based AdaLN mechanism; and (iii) target-indexed Residual Adapters $\{A_j\}$ for localized manifold calibration. During inference, the source latent $\mathbf{z}_i$ is concatenated with noise $\mathbf{z}_t$ and processed by the shared backbone; the predicted $\hat{\mathbf{z}}_j$ is then rectified by the corresponding adapter $A_j$ and reconstructed via $D_j$ in a single-pass feed-forward trajectory, ensuring efficient synthesis with constant computational overhead for any modality pair.
  • Figure 4: Qualitative comparison between our proposed Any2Any (A2A) and other compar methods on the test datasets. Across all modality translation tasks, our Any2Any produces results that are closer to the reference images and better preserve semantic consistency and structural integrity, demonstrating the effectiveness of the proposed method.
  • Figure 5: Qualitative results of our method on unseen remote sensing modality translation tasks with missing paired training data. The results demonstrate that our approach produces reasonable and semantically consistent translations under missing-modality settings, validating its any-to-any translation capability.
  • ...and 2 more figures