Table of Contents
Fetching ...

CoDeGAN: Contrastive Disentanglement for Generative Adversarial Network

Jiangwei Zhao, Zejia Liu, Xiaohan Guo, Lili Pan

TL;DR

CoDeGAN addresses discrete-factor disentanglement in GANs by replacing image-domain similarity with a feature-domain contrastive loss and by incorporating self-supervised pre-training to learn semantic representations. It introduces a class-related encoder $E_c$, an intra-class encoder $E_z$, and losses $\mathcal{L}_c$ and $\mathcal{L}_{\boldsymbol{z}}$, optimized through alternating updates of $D$, $G$, and $E$, achieving improved stability and disentanglement without heavy reliance on mutual information terms. Empirical results across MNIST, Fashion-MNIST, CIFAR-10, COIL-20, and 3D datasets show state-of-the-art disentanglement metrics (ACC, NMI, ARI) with competitive image quality (IS, FID), and self-supervised pre-training yields further gains, especially on challenging CIFAR-10. The findings suggest a practical and scalable path to unsupervised discrete-factor disentanglement in GANs, with robust performance and potential for extending to multi-factor disentanglement in future work.

Abstract

Disentanglement, a critical concern in interpretable machine learning, has also garnered significant attention from the computer vision community. Many existing GAN-based class disentanglement (unsupervised) approaches, such as InfoGAN and its variants, primarily aim to maximize the mutual information (MI) between the generated image and its latent codes. However, this focus may lead to a tendency for the network to generate highly similar images when presented with the same latent class factor, potentially resulting in mode collapse or mode dropping. To alleviate this problem, we propose \texttt{CoDeGAN} (Contrastive Disentanglement for Generative Adversarial Networks), where we relax similarity constraints for disentanglement from the image domain to the feature domain. This modification not only enhances the stability of GAN training but also improves their disentangling capabilities. Moreover, we integrate self-supervised pre-training into CoDeGAN to learn semantic representations, significantly facilitating unsupervised disentanglement. Extensive experimental results demonstrate the superiority of our method over state-of-the-art approaches across multiple benchmarks. The code is available at https://github.com/learninginvision/CoDeGAN.

CoDeGAN: Contrastive Disentanglement for Generative Adversarial Network

TL;DR

CoDeGAN addresses discrete-factor disentanglement in GANs by replacing image-domain similarity with a feature-domain contrastive loss and by incorporating self-supervised pre-training to learn semantic representations. It introduces a class-related encoder , an intra-class encoder , and losses and , optimized through alternating updates of , , and , achieving improved stability and disentanglement without heavy reliance on mutual information terms. Empirical results across MNIST, Fashion-MNIST, CIFAR-10, COIL-20, and 3D datasets show state-of-the-art disentanglement metrics (ACC, NMI, ARI) with competitive image quality (IS, FID), and self-supervised pre-training yields further gains, especially on challenging CIFAR-10. The findings suggest a practical and scalable path to unsupervised discrete-factor disentanglement in GANs, with robust performance and potential for extending to multi-factor disentanglement in future work.

Abstract

Disentanglement, a critical concern in interpretable machine learning, has also garnered significant attention from the computer vision community. Many existing GAN-based class disentanglement (unsupervised) approaches, such as InfoGAN and its variants, primarily aim to maximize the mutual information (MI) between the generated image and its latent codes. However, this focus may lead to a tendency for the network to generate highly similar images when presented with the same latent class factor, potentially resulting in mode collapse or mode dropping. To alleviate this problem, we propose \texttt{CoDeGAN} (Contrastive Disentanglement for Generative Adversarial Networks), where we relax similarity constraints for disentanglement from the image domain to the feature domain. This modification not only enhances the stability of GAN training but also improves their disentangling capabilities. Moreover, we integrate self-supervised pre-training into CoDeGAN to learn semantic representations, significantly facilitating unsupervised disentanglement. Extensive experimental results demonstrate the superiority of our method over state-of-the-art approaches across multiple benchmarks. The code is available at https://github.com/learninginvision/CoDeGAN.

Paper Structure

This paper contains 27 sections, 2 theorems, 13 equations, 13 figures, 3 tables.

Key Result

Lemma 1

Minimizing the contrastive loss $\mathcal{L}_c$ is equivalent to maximizing the mutual information between generative representations $\mathbf{f}$ and $\mathbf{f}^+$, i.e., $E_c\left(G\left(\mathbf{z},c\right)\right)$ and $E_c\left(G\left(\mathbf{z}^+,c^+\right)\right)$.

Figures (13)

  • Figure 1: Contrastive disentanglement framework. The input of generator G consists of two parts: (i) $c\sim mul\left(\bm{\pi}\right)$, which controls the detailed class and (ii) $\mathbf{z}\sim\mathcal{N}\left(\mathbf{0}, \sigma^2\mathbf{I}\right)$, which corresponds to intra-class variation. The representation $\mathbf{f}$ encoded by $E_c$ is regularized by contrastive loss $\mathcal{L}_c$ for disentanglement, while the representation $\hat{\mathbf{z}}$ encoded by $E_{\mathbf{z}}$ is regularized by reconstruction loss $\mathcal{L}_{\mathbf{z}}$ for preserving intra-class variation. $E_c$ and $E_{\mathbf{z}}$ could share weights partly. The definitions of positive and negative pairs in $\mathcal{L}_c$ are images within the same class or not.
  • Figure 2: Comparison of several extensions of the original CoDeGAN. (a): CoDeGAN. (b): CoDeGAN (self) or CoDeGAN (meta). $E_c$ is pretrained by contrastive learning or meta-learning.
  • Figure 3: Disentanglement accuracy and generative quality of CoDeGAN and InfoGAN with different trade-offs on CIFAR-10. (a) X-axis denotes ACC ($\%$), and y-axis denotes FID. Orange line:CoDeGAN with loss $\mathcal{L}_{GAN} + \beta_1 \mathcal{L}_c$. Black line: InfoGAN with loss$\mathcal{L}_{GAN}+\beta_1\mathcal{L}_{MI}$. CodeGAN achieves higher ACC and lower FID than InfoGAN for most the trad-offs. (b) The worst generated images of CoDeGAN with different trade-offs. (c) The worst generated images of InfoGAN with different trade-offs. From left to right, the generated images corresponds to the best(green circle), the much lager(blue triangle), and the largest(yellow cross) $\beta_1$. The presence of red boxes in (b) and (c) indicates mode collapse for some certain class.
  • Figure 4: Visualization of the encoded features of the generated images by variant CoDeGANs. For all settings, $10000$ points are sampled from $p\left(\mathbf{z},c\right)$, the number of sampled points for each fixed $c$ is the same, and different color corresponds to different values of factor $c$. (a): CoDeGAN. (b): CoDeGAN with pre-trained $E_c$. The encoder $E_c$ is pre-trained using SimCLR.
  • Figure 5: Disentanglement accuracy of CoDeGAN with and without pre-training. Pre-training significantly improves the disentanglement accuracy on Fashion-MNIST, 3D-Chairs, 3D-Cars and COIL-20.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Lemma 2
  • proof