Table of Contents
Fetching ...

HoloGAN: Unsupervised learning of 3D representations from natural images

Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, Christian Richardt, Yong-Liang Yang

TL;DR

HoloGAN tackles unsupervised learning of 3D representations from natural 2D images by injecting a strong 3D inductive bias into a GAN, using a learned 3D feature volume, rigid-body pose transformations, and a differentiable projection unit. This design enables explicit pose control and disentanglement of pose, shape, and appearance without pose or 3D supervision, while maintaining competitive image quality. Through qualitative and quantitative experiments across diverse datasets, it demonstrates view manipulation capabilities, deeper 3D representations than voxel grids, and robust disentanglement, highlighting the potential of explicit 3D representations in generative modeling. The work also provides ablation analyses showing the critical role of random 3D transformations and 3D-structure-based latent organization for successful disentanglement and high-fidelity rendering.

Abstract

We propose a novel generative adversarial network (GAN) for the task of unsupervised learning of 3D representations from natural images. Most generative models rely on 2D kernels to generate images and make few assumptions about the 3D world. These models therefore tend to create blurry images or artefacts in tasks that require a strong 3D understanding, such as novel-view synthesis. HoloGAN instead learns a 3D representation of the world, and to render this representation in a realistic manner. Unlike other GANs, HoloGAN provides explicit control over the pose of generated objects through rigid-body transformations of the learnt 3D features. Our experiments show that using explicit 3D features enables HoloGAN to disentangle 3D pose and identity, which is further decomposed into shape and appearance, while still being able to generate images with similar or higher visual quality than other generative models. HoloGAN can be trained end-to-end from unlabelled 2D images only. Particularly, we do not require pose labels, 3D shapes, or multiple views of the same objects. This shows that HoloGAN is the first generative model that learns 3D representations from natural images in an entirely unsupervised manner.

HoloGAN: Unsupervised learning of 3D representations from natural images

TL;DR

HoloGAN tackles unsupervised learning of 3D representations from natural 2D images by injecting a strong 3D inductive bias into a GAN, using a learned 3D feature volume, rigid-body pose transformations, and a differentiable projection unit. This design enables explicit pose control and disentanglement of pose, shape, and appearance without pose or 3D supervision, while maintaining competitive image quality. Through qualitative and quantitative experiments across diverse datasets, it demonstrates view manipulation capabilities, deeper 3D representations than voxel grids, and robust disentanglement, highlighting the potential of explicit 3D representations in generative modeling. The work also provides ablation analyses showing the critical role of random 3D transformations and 3D-structure-based latent organization for successful disentanglement and high-fidelity rendering.

Abstract

We propose a novel generative adversarial network (GAN) for the task of unsupervised learning of 3D representations from natural images. Most generative models rely on 2D kernels to generate images and make few assumptions about the 3D world. These models therefore tend to create blurry images or artefacts in tasks that require a strong 3D understanding, such as novel-view synthesis. HoloGAN instead learns a 3D representation of the world, and to render this representation in a realistic manner. Unlike other GANs, HoloGAN provides explicit control over the pose of generated objects through rigid-body transformations of the learnt 3D features. Our experiments show that using explicit 3D features enables HoloGAN to disentangle 3D pose and identity, which is further decomposed into shape and appearance, while still being able to generate images with similar or higher visual quality than other generative models. HoloGAN can be trained end-to-end from unlabelled 2D images only. Particularly, we do not require pose labels, 3D shapes, or multiple views of the same objects. This shows that HoloGAN is the first generative model that learns 3D representations from natural images in an entirely unsupervised manner.

Paper Structure

This paper contains 27 sections, 4 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: HoloGAN learns to separate pose from identity (shape and appearance) only from unlabelled 2D images without sacrificing the visual fidelity of the generated images. All results shown here are sampled from HoloGAN for the same identities in each row but in different poses.
  • Figure 2: Comparison of generative image models. Data given to the discriminator are coloured purple. Left: In conditional GANs, the pose is observed and the discriminator is given access to this information. Right: HoloGAN does not require pose labels during training and the discriminator is not given access to pose information.
  • Figure 3: HoloGAN's generator network: we employ 3D convolutions, a 3D rigid-body transformation, the projection unit and 2D convolutions. We also remove the traditional input layer from $\mathbf{z}$, and start from a learnt constant 4D tensor. The latent vector $\textbf{z}$ is instead fed through multilayer perceptrons (MLPs) to map to the affine transformation parameters for adaptive instance normalisation (AdaIN). Inputs are coloured gray.
  • Figure 4: For the Chairs dataset with high intra-class variation, HoloGAN can still disentangle pose (360° azimuth, 160° elevation) and identity.
  • Figure 5: We compare HoloGAN to InfoGAN (images adapted from Chen2016) on CelebA (64$\times$64) in the task of separating identity and azimuth. Note that we cannot control what can be learnt by InfoGAN.
  • ...and 4 more figures