Unsupervised Generative 3D Shape Learning from Natural Images
Attila Szabó, Givi Meishvili, Paolo Favaro
TL;DR
This work tackles unsupervised learning of explicit 3D shapes directly from natural images by integrating a GAN with a differentiable renderer, enabling a latent-to-3D-scene pipeline that renders from random viewpoints $x_f = R(G(z), v)$ and is trained to match real images via a discriminator. The method outputs a scene representation $m=[\mathbf{s}, \mathbf{t}, \mathbf{b}]$ (shape, texture, background) and uses a differentiable renderer $R$ to enforce multi-view realism, with a training objective $\min_G \max_D \mathbb{E}_{x_f \sim p_r}[ \log D(x_f) ] + \mathbb{E}_{z \sim \mathcal{N}(0,I), v \sim p_v}[ \log(1 - D(R(G(z), v))) ]$. Core contributions include a differentiable renderer with exact gradients at object boundaries, an analysis of ambiguities and priors necessary for learning, and a Shape-Texture-Background decomposition compatible with convolutional networks via a StyleGAN-based generator; the approach is demonstrated on FFHQ faces, showing qualitative recovery of plausible 3D shapes from natural images. This work advances unsupervised 3D understanding from real data and lays groundwork for scalable 3D generative modeling from uncategorized images.
Abstract
In this paper we present, to the best of our knowledge, the first method to learn a generative model of 3D shapes from natural images in a fully unsupervised way. For example, we do not use any ground truth 3D or 2D annotations, stereo video, and ego-motion during the training. Our approach follows the general strategy of Generative Adversarial Networks, where an image generator network learns to create image samples that are realistic enough to fool a discriminator network into believing that they are natural images. In contrast, in our approach the image generation is split into 2 stages. In the first stage a generator network outputs 3D objects. In the second, a differentiable renderer produces an image of the 3D objects from random viewpoints. The key observation is that a realistic 3D object should yield a realistic rendering from any plausible viewpoint. Thus, by randomizing the choice of the viewpoint our proposed training forces the generator network to learn an interpretable 3D representation disentangled from the viewpoint. In this work, a 3D representation consists of a triangle mesh and a texture map that is used to color the triangle surface by using the UV-mapping technique. We provide analysis of our learning approach, expose its ambiguities and show how to overcome them. Experimentally, we demonstrate that our method can learn realistic 3D shapes of faces by using only the natural images of the FFHQ dataset.
