Table of Contents
Fetching ...

Unsupervised Generative 3D Shape Learning from Natural Images

Attila Szabó, Givi Meishvili, Paolo Favaro

TL;DR

This work tackles unsupervised learning of explicit 3D shapes directly from natural images by integrating a GAN with a differentiable renderer, enabling a latent-to-3D-scene pipeline that renders from random viewpoints $x_f = R(G(z), v)$ and is trained to match real images via a discriminator. The method outputs a scene representation $m=[\mathbf{s}, \mathbf{t}, \mathbf{b}]$ (shape, texture, background) and uses a differentiable renderer $R$ to enforce multi-view realism, with a training objective $\min_G \max_D \mathbb{E}_{x_f \sim p_r}[ \log D(x_f) ] + \mathbb{E}_{z \sim \mathcal{N}(0,I), v \sim p_v}[ \log(1 - D(R(G(z), v))) ]$. Core contributions include a differentiable renderer with exact gradients at object boundaries, an analysis of ambiguities and priors necessary for learning, and a Shape-Texture-Background decomposition compatible with convolutional networks via a StyleGAN-based generator; the approach is demonstrated on FFHQ faces, showing qualitative recovery of plausible 3D shapes from natural images. This work advances unsupervised 3D understanding from real data and lays groundwork for scalable 3D generative modeling from uncategorized images.

Abstract

In this paper we present, to the best of our knowledge, the first method to learn a generative model of 3D shapes from natural images in a fully unsupervised way. For example, we do not use any ground truth 3D or 2D annotations, stereo video, and ego-motion during the training. Our approach follows the general strategy of Generative Adversarial Networks, where an image generator network learns to create image samples that are realistic enough to fool a discriminator network into believing that they are natural images. In contrast, in our approach the image generation is split into 2 stages. In the first stage a generator network outputs 3D objects. In the second, a differentiable renderer produces an image of the 3D objects from random viewpoints. The key observation is that a realistic 3D object should yield a realistic rendering from any plausible viewpoint. Thus, by randomizing the choice of the viewpoint our proposed training forces the generator network to learn an interpretable 3D representation disentangled from the viewpoint. In this work, a 3D representation consists of a triangle mesh and a texture map that is used to color the triangle surface by using the UV-mapping technique. We provide analysis of our learning approach, expose its ambiguities and show how to overcome them. Experimentally, we demonstrate that our method can learn realistic 3D shapes of faces by using only the natural images of the FFHQ dataset.

Unsupervised Generative 3D Shape Learning from Natural Images

TL;DR

This work tackles unsupervised learning of explicit 3D shapes directly from natural images by integrating a GAN with a differentiable renderer, enabling a latent-to-3D-scene pipeline that renders from random viewpoints and is trained to match real images via a discriminator. The method outputs a scene representation (shape, texture, background) and uses a differentiable renderer to enforce multi-view realism, with a training objective . Core contributions include a differentiable renderer with exact gradients at object boundaries, an analysis of ambiguities and priors necessary for learning, and a Shape-Texture-Background decomposition compatible with convolutional networks via a StyleGAN-based generator; the approach is demonstrated on FFHQ faces, showing qualitative recovery of plausible 3D shapes from natural images. This work advances unsupervised 3D understanding from real data and lays groundwork for scalable 3D generative modeling from uncategorized images.

Abstract

In this paper we present, to the best of our knowledge, the first method to learn a generative model of 3D shapes from natural images in a fully unsupervised way. For example, we do not use any ground truth 3D or 2D annotations, stereo video, and ego-motion during the training. Our approach follows the general strategy of Generative Adversarial Networks, where an image generator network learns to create image samples that are realistic enough to fool a discriminator network into believing that they are natural images. In contrast, in our approach the image generation is split into 2 stages. In the first stage a generator network outputs 3D objects. In the second, a differentiable renderer produces an image of the 3D objects from random viewpoints. The key observation is that a realistic 3D object should yield a realistic rendering from any plausible viewpoint. Thus, by randomizing the choice of the viewpoint our proposed training forces the generator network to learn an interpretable 3D representation disentangled from the viewpoint. In this work, a 3D representation consists of a triangle mesh and a texture map that is used to color the triangle surface by using the UV-mapping technique. We provide analysis of our learning approach, expose its ambiguities and show how to overcome them. Experimentally, we demonstrate that our method can learn realistic 3D shapes of faces by using only the natural images of the FFHQ dataset.

Paper Structure

This paper contains 10 sections, 1 theorem, 8 equations, 6 figures, 1 table.

Key Result

Theorem 1

When the above assumptions are satisfied, the generated scene representation distribution is identical to the real one, thus $G({\mathbf{z}}) \sim p_{\mathbf{m}}$, with ${\mathbf{z}} \sim {\cal N}(0,I)$.

Figures (6)

  • Figure 1: Samples from our generator trained on the FFHQ dataset at $128 \times 128$ resolution. The first column shows random rendered samples. The other columns show the 3D normal map, texture, background and textured 3D shapes for 5 canonical viewpoints in the range of $\pm 90$ degrees.
  • Figure 2: Illustration of the training setup. $G$ and $D$ are the generator and discriminator neural networks. $R$ is the differentiable renderer and it has no trainable parameters. The random variables ${\mathbf{z}}$, ${\mathbf{m}}$ and ${\mathbf{v}}$ are the latent vector, 3D object and the viewpoint parameters. The fake images are ${\mathbf{x}}_f$ and the real images are ${\mathbf{x}}_r$.
  • Figure 3: a) illustration of the soft renderer; b) results with crisp renderer; c) results with soft renderer; d) results with soft renderer + size constraint; e) results with soft renderer + size constraint + pyramid on viewpoint range of $\pm120$ degrees.
  • Figure 4: Samples from our generator trained on the FFHQ dataset at $128 \times 128$ resolution. The first column shows random rendered samples. The other columns show the 3D normal map, texture, background and textured 3D shapes for 5 canonical viewpoints in the range of $\pm 90$ degrees.
  • Figure 5: Interpolated 3D faces. From top to bottom the 3D models are generated by linearly interpolating the latent vector fed to the generator. We show 3 viewpoint in the range of $\pm 45$ degrees
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1