Unsupervised Multiple Domain Translation through Controlled Disentanglement in Variational Autoencoder
Antonio Almudévar, Théo Mariotte, Alfonso Ortega, Marie Tahon
TL;DR
The paper tackles unsupervised translation across multiple domains without paired data by proposing a two-latent Variational Autoencoder. It separates domain information into a dedicated latent $z_l$ and encodes all other variability into a style latent $z_u$, with domain control achieved by conditioning the prior $p(z_l|c)$ and performing linear rotations in latent space to translate between domains via $ ilde{z}_l=T_t(T_c^T z_l)$. The priors are constructed using random rotations $T_c$ applied to a base mean $oldsymbol{}$ to yield $oldsymbol{}_c$, so $p(z_l|c)= ext{N}(oldsymbol{}_c,I)$ and $p(z_u)= ext{N}(0,I)$, while the encoder/decoder remain domain-agnostic and the objective includes two $eta$-scaled KL terms. Empirical results on MNIST, SVHN, and Cars3D demonstrate realistic translations and effective disentanglement, as shown by high classifier accuracy on $z_l$ and near-random performance on $z_u$, indicating a clean separation between domain and non-domain information. Overall, the method provides a GAN-free, interpretable, and scalable approach for unsupervised multi-domain translation with controllable latent space geometry.
Abstract
Unsupervised Multiple Domain Translation is the task of transforming data from one domain to other domains without having paired data to train the systems. Typically, methods based on Generative Adversarial Networks (GANs) are used to address this task. However, our proposal exclusively relies on a modified version of a Variational Autoencoder. This modification consists of the use of two latent variables disentangled in a controlled way by design. One of this latent variables is imposed to depend exclusively on the domain, while the other one must depend on the rest of the variability factors of the data. Additionally, the conditions imposed over the domain latent variable allow for better control and understanding of the latent space. We empirically demonstrate that our approach works on different vision datasets improving the performance of other well known methods. Finally, we prove that, indeed, one of the latent variables stores all the information related to the domain and the other one hardly contains any domain information.
