Table of Contents
Fetching ...

Unsupervised Image-to-Image Translation Networks

Ming-Yu Liu, Thomas Breuel, Jan Kautz

TL;DR

The paper tackles unsupervised image-to-image translation by introducing UNIT, a framework that enforces a shared latent space across two domains via weight-sharing between encoders and generators paired with VAE-GANs. The approach yields bidirectional translation streams and cycle-consistent reconstructions through a combined loss of VAE, GAN, and cycle-consistency terms, learned with alternating optimization. Empirical results demonstrate strong performance across diverse tasks, including map-satellite translation, street scenes, animal and face translations, and unsupervised domain adaptation, with ablations confirming the benefit of the shared latent structure. The work situates itself relative to CoGAN and CycleGAN, arguing that the shared-latent-space constraint naturally induces cycle-consistency while enabling robust, unpaired cross-domain translation. Overall, UNIT offers a versatile, end-to-end framework for cross-domain image translation without paired data and highlights avenues for improved multimodality and stability.

Abstract

Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in https://github.com/mingyuliutw/unit .

Unsupervised Image-to-Image Translation Networks

TL;DR

The paper tackles unsupervised image-to-image translation by introducing UNIT, a framework that enforces a shared latent space across two domains via weight-sharing between encoders and generators paired with VAE-GANs. The approach yields bidirectional translation streams and cycle-consistent reconstructions through a combined loss of VAE, GAN, and cycle-consistency terms, learned with alternating optimization. Empirical results demonstrate strong performance across diverse tasks, including map-satellite translation, street scenes, animal and face translations, and unsupervised domain adaptation, with ablations confirming the benefit of the shared latent structure. The work situates itself relative to CoGAN and CycleGAN, arguing that the shared-latent-space constraint naturally induces cycle-consistency while enabling robust, unpaired cross-domain translation. Overall, UNIT offers a versatile, end-to-end framework for cross-domain image translation without paired data and highlights avenues for improved multimodality and stability.

Abstract

Unsupervised image-to-image translation aims at learning a joint distribution of images in different domains by using images from the marginal distributions in individual domains. Since there exists an infinite set of joint distributions that can arrive the given marginal distributions, one could infer nothing about the joint distribution from the marginal distributions without additional assumptions. To address the problem, we make a shared-latent space assumption and propose an unsupervised image-to-image translation framework based on Coupled GANs. We compare the proposed framework with competing approaches and present high quality image translation results on various challenging unsupervised image translation tasks, including street scene image translation, animal image translation, and face image translation. We also apply the proposed framework to domain adaptation and achieve state-of-the-art performance on benchmark datasets. Code and additional results are available in https://github.com/mingyuliutw/unit .

Paper Structure

This paper contains 8 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) The shared latent space assumption. We assume a pair of corresponding images $(x_1,x_2)$ in two different domains $\mathcal{X}_1$ and $\mathcal{X}_2$ can be mapped to a same latent code $z$ in a shared-latent space $\mathcal{Z}$. $E_1$ and $E_2$ are two encoding functions, mapping images to latent codes. $G_1$ and $G_2$ are two generation functions, mapping latent codes to images. (b) The proposed UNIT framework. We represent $E_1$$E_2$$G_1$ and $G_2$ using CNNs and implement the shared-latent space assumption using a weight sharing constraint where the connection weights of the last few layers (high-level layers) in $E_1$ and $E_2$ are tied (illustrated using dashed lines) and the connection weights of the first few layers (high-level layers) in $G_1$ and $G_2$ are tied. Here, $\tilde{x}_1^{1\rightarrow 1}$ and $\tilde{x}_2^{2\rightarrow 2}$ are self-reconstructed images, and $\tilde{x}_1^{1\rightarrow 2}$ and $\tilde{x}_2^{2\rightarrow 1}$ are domain-translated images. $D_1$ and $D_2$ are adversarial discriminators for the respective domains, in charge of evaluating whether the translated images are realistic.
  • Figure 2: (a) Illustration of the Map dataset. Left: satellite image. Right: map. We translate holdout satellite images to maps and measure the accuracy achieved by various configurations of the proposed framework. (b) Translation accuracy versus different network architectures. (c) Translation accuracy versus different hyper-parameter values. (d) Impact of weight-sharing and cycle-consistency constraints on translation accuracy.
  • Figure 3: Street scene image translation results. For each pair, left is input and right is the translated image.