Table of Contents
Fetching ...

Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder

Antonio Almudévar, Théo Mariotte, Alfonso Ortega, Marie Tahon

TL;DR

The paper tackles unsupervised translation across multiple domains without paired data by proposing a two-latent Variational Autoencoder. It separates domain information into a dedicated latent $z_l$ and encodes all other variability into a style latent $z_u$, with domain control achieved by conditioning the prior $p(z_l|c)$ and performing linear rotations in latent space to translate between domains via $ ilde{z}_l=T_t(T_c^T z_l)$. The priors are constructed using random rotations $T_c$ applied to a base mean $oldsymbol{}$ to yield $oldsymbol{}_c$, so $p(z_l|c)= ext{N}(oldsymbol{}_c,I)$ and $p(z_u)= ext{N}(0,I)$, while the encoder/decoder remain domain-agnostic and the objective includes two $eta$-scaled KL terms. Empirical results on MNIST, SVHN, and Cars3D demonstrate realistic translations and effective disentanglement, as shown by high classifier accuracy on $z_l$ and near-random performance on $z_u$, indicating a clean separation between domain and non-domain information. Overall, the method provides a GAN-free, interpretable, and scalable approach for unsupervised multi-domain translation with controllable latent space geometry.

Abstract

Unsupervised Multiple Domain Translation is the task of transforming data from one domain to other domains without having paired data to train the systems. Typically, methods based on Generative Adversarial Networks (GANs) are used to address this task. However, our proposal exclusively relies on a modified version of a Variational Autoencoder. This modification consists of the use of two latent variables disentangled in a controlled way by design. One of this latent variables is imposed to depend exclusively on the domain, while the other one must depend on the rest of the variability factors of the data. Additionally, the conditions imposed over the domain latent variable allow for better control and understanding of the latent space. We empirically demonstrate that our approach works on different vision datasets improving the performance of other well known methods. Finally, we prove that, indeed, one of the latent variables stores all the information related to the domain and the other one hardly contains any domain information.

Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder

TL;DR

The paper tackles unsupervised translation across multiple domains without paired data by proposing a two-latent Variational Autoencoder. It separates domain information into a dedicated latent and encodes all other variability into a style latent , with domain control achieved by conditioning the prior and performing linear rotations in latent space to translate between domains via . The priors are constructed using random rotations applied to a base mean to yield , so and , while the encoder/decoder remain domain-agnostic and the objective includes two -scaled KL terms. Empirical results on MNIST, SVHN, and Cars3D demonstrate realistic translations and effective disentanglement, as shown by high classifier accuracy on and near-random performance on , indicating a clean separation between domain and non-domain information. Overall, the method provides a GAN-free, interpretable, and scalable approach for unsupervised multi-domain translation with controllable latent space geometry.

Abstract

Unsupervised Multiple Domain Translation is the task of transforming data from one domain to other domains without having paired data to train the systems. Typically, methods based on Generative Adversarial Networks (GANs) are used to address this task. However, our proposal exclusively relies on a modified version of a Variational Autoencoder. This modification consists of the use of two latent variables disentangled in a controlled way by design. One of this latent variables is imposed to depend exclusively on the domain, while the other one must depend on the rest of the variability factors of the data. Additionally, the conditions imposed over the domain latent variable allow for better control and understanding of the latent space. We empirically demonstrate that our approach works on different vision datasets improving the performance of other well known methods. Finally, we prove that, indeed, one of the latent variables stores all the information related to the domain and the other one hardly contains any domain information.
Paper Structure (14 sections, 4 equations, 6 figures, 2 tables)

This paper contains 14 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Scheme of how the translation is performed in a two-dimensional latent space with three domains. In this case, the input belongs to domain 1 and is translated to domain 2 by rotating the latent variable the same angle that separates the regions of the prior distributions of these two domains.
  • Figure 2: Scheme of how an input image $x$ whose class is $c$ is translated to class $t$. The original input is first encoded obtaining $p(z_l)$ and $q_\phi(z_u)$. Then, the mean of $q_\phi(z_l)$ is multiplied by $T_c^T$ and $T_t$ to obtain the mean of $q_\phi(\tilde{z}_l)$. Finally, a sample from this last distribution and another from $q_\phi(z_u)$ are decoded to obtain the translated version of the input. In this figure the input class $c$ is 9 and the target class $t$ is 5.
  • Figure 3: Graphical model of the translation method. Dashed lines correspond to the encoding processes $q_\phi(z_l|x)$ and $q_\phi(z_u|x)$. Solid lines correspond to the decoding process in which the target class $t$ is used to modify the latent variable $z_l$ before getting $p_\theta(x|z_l, z_u)$.
  • Figure 4: Comparison between our method and StarGAN for MNIST
  • Figure 5: Results for SVHN
  • ...and 1 more figures