Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Jiaming Liu; Felix Petersen; Yunhe Gao; Yabin Zhang; Hyojin Kim; Akshay S. Chaudhari; Yu Sun; Stefano Ermon; Sergios Gatidis

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Jiaming Liu, Felix Petersen, Yunhe Gao, Yabin Zhang, Hyojin Kim, Akshay S. Chaudhari, Yu Sun, Stefano Ermon, Sergios Gatidis

TL;DR

The paper introduces the Self-Supervised Semantic Bridge (SSB), a diffusion-based framework for unpaired image-to-image translation that builds a shared, geometry-preserving latent space from self-supervised encoders to connect domains without cross-domain supervision or adversarial losses. Domain-specific diffusion bridges map from this shared latent to per-domain latents, enabling faithful MRI–CT synthesis and natural image editing, including text-guided editing with large-scale priors. A theoretical error bound (Theorem 4.1) quantifies how encoder misalignment, vector-field approximation, discretization, and decoder reconstruction affect translation, guiding practical design choices. Empirical results on medical MRI–CT data and natural image editing demonstrate improved structural fidelity and out-of-domain robustness relative to strong priors, with linear scaling to additional domains. The approach enables scalable, high-quality unpaired translation across modalities and supports controllable text-guided edits in diffusion-based pipelines.

Abstract

Adversarial diffusion and diffusion-inversion methods have advanced unpaired image-to-image translation, but each faces key limitations. Adversarial approaches require target-domain adversarial loss during training, which can limit generalization to unseen data, while diffusion-inversion methods often produce low-fidelity translations due to imperfect inversion into noise-latent representations. In this work, we propose the Self-Supervised Semantic Bridge (SSB), a versatile framework that integrates external semantic priors into diffusion bridge models to enable spatially faithful translation without cross-domain supervision. Our key idea is to leverage self-supervised visual encoders to learn representations that are invariant to appearance changes but capture geometric structure, forming a shared latent space that conditions the diffusion bridges. Extensive experiments show that SSB outperforms strong prior methods for challenging medical image synthesis in both in-domain and out-of-domain settings, and extends easily to high-quality text-guided editing.

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

TL;DR

Abstract

Paper Structure (34 sections, 3 theorems, 51 equations, 17 figures, 8 tables, 1 algorithm)

This paper contains 34 sections, 3 theorems, 51 equations, 17 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Proposed Method
Shared Latent Space Assumption
Shared Latent Space Encoders
Latent Bridge Models as Conditional Decoders
Translation Error Analysis
Practical Design Choices
Experiments
Experimental Setup
Appearance-Invariant Encoder Validation
Unpaired Image Translation
Text-to-Image Editing
Conclusion and Limitations
...and 19 more sections

Key Result

Theorem 4.1

Given target and source images that share a latent code under a learned vision encoder $E_\phi$. Assume the learned vector field $v_\theta$ and decoder $D_\varphi$ are Lipschitz-continuous, and both the encoder and decoder introduce bounded reconstruction errors. If the bridge ODE is solved using an where each term corresponds respectively to the encoder alignment, vector-field approximation, disc

Figures (17)

Figure 1: Overview of our Self-Supervised Semantic-Bridge (SSB) framework for unpaired image translation and editing. SSB trains without paired data or adversarial objectives, relying on a shared latent-space assumption to connect domains via a common representation; $\times$ denotes no cross-domain supervision. The top rows showcase image-to-image translation results using text-free guidance, while the bottom rows demonstrate text-guided editing based on Stable Diffusion 3-Medium esser2024scaling. This single-domain training enables unpaired translation across medical and natural images with improved structural consistency and OOD robustness (e.g., unseen MRI contrasts at test-time).
Figure 2: Unlike inversion-based methods that invert toward an unstructured Gaussian noise, SSB defines a unified semantic latent endpoint ${\bm{y}} = E_{\phi}({\bm{x}})$ using a self-supervised visual encoder and trains domain-specific bridges independently to connect each domain to this shared endpoint, enabling reliable translation by composing source-to-$T$ inversion with target-bridge generation.
Figure 3: Cross-modal semantic alignment and improvement in the shared latent space.Left: PCA visualizations of intermediate feature maps for paired CT/MRI inputs. Our ViT-B/8 encoder yields smoother and more anatomically consistent representations across modalities. Middle: Quantitative alignment on paired MRI--CT validation samples, comparing global alignment (cosine similarity) and local structural consistency (CKNNA). Right: Semantic improvement $\Delta_{\text{cosim}}$ measures the net gain in feature alignment with the ground-truth CT ($\text{gtCT}$) achieved by our generated translation ($\text{genCT}$) over the original input ($\text{MRI}$). Formally, $\Delta_{\text{cosim}} = \operatorname{cosim}(E_\phi(\text{genCT}), E_\phi(\text{gtCT})) - \operatorname{cosim}(E_\phi(\text{MRI}), E_\phi(\text{gtCT}))$. We plot this gain against the initial input alignment $x$. The dashed line $y = 1 - x$ represents the theoretical ceiling, corresponding to perfect recovery of the ground-truth features (similarity $= 1$). The improvements confirm that our method effectively minimizes semantic divergence, largely mitigating the impact of the initial domain gap.
Figure 4: Our SSB ensures anatomically consistent MRI$\rightarrow$CT translation across in-domain and out-of-domain scenarios. Segmentation masks are overlaid on MRI source images in OOD settings to provide structural reference, as paired CT ground truth is unavailable. They are not used during training or inference and serve only to illustrate anatomical fidelity without segmentation supervision.
Figure 5: Visual comparison with recent diffusion editing methods using SD3-M base model. Our approach generates visually consistent results with the source image, preserving structural integrity while convincingly applying appearance changes across diverse scenarios.
...and 12 more figures

Theorems & Definitions (7)

Theorem 4.1
Remark A.5: Appearance invariance
proof
Theorem A.6: Grönwall Theorem
proof
Theorem A.7: Markov Inequality
proof

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

TL;DR

Abstract

Unpaired Image-to-Image Translation via a Self-Supervised Semantic Bridge

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (7)