Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge
Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis
TL;DR
The paper introduces the Self-Supervised Semantic Bridge (SSB), a diffusion-based framework for unpaired image-to-image translation that builds a shared, geometry-preserving latent space from self-supervised encoders to connect domains without cross-domain supervision or adversarial losses. Domain-specific diffusion bridges map from this shared latent to per-domain latents, enabling faithful MRI–CT synthesis and natural image editing, including text-guided editing with large-scale priors. A theoretical error bound (Theorem 4.1) quantifies how encoder misalignment, vector-field approximation, discretization, and decoder reconstruction affect translation, guiding practical design choices. Empirical results on medical MRI–CT data and natural image editing demonstrate improved structural fidelity and out-of-domain robustness relative to strong priors, with linear scaling to additional domains. The approach enables scalable, high-quality unpaired translation across modalities and supports controllable text-guided edits in diffusion-based pipelines.
Abstract
Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.
