Table of Contents
Fetching ...

Deep Convolutional Inverse Graphics Network

Tejas D. Kulkarni, Will Whitney, Pushmeet Kohli, Joshua B. Tenenbaum

TL;DR

DC-IGN tackles learning interpretable, disentangled representations that separate pose, lighting, and intrinsic properties to enable controllable image re-rendering. It uses a convolutional encoder–decoder trained with Stochastic Gradient Variational Bayes, plus a targeted training procedure that assigns specific latent groups to distinct transformations. The approach yields a functioning 3D rendering engine capable of novel-view synthesis on 3D faces and chairs, outperforming entangled baselines in representing transformations. This work advances inverse graphics by providing an end-to-end, data-driven method for automatic disentanglement without explicit supervision.

Abstract

This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that learns an interpretable representation of images. This representation is disentangled with respect to transformations such as out-of-plane rotations and lighting variations. The DC-IGN model is composed of multiple layers of convolution and de-convolution operators and is trained using the Stochastic Gradient Variational Bayes (SGVB) algorithm. We propose a training procedure to encourage neurons in the graphics code layer to represent a specific transformation (e.g. pose or light). Given a single input image, our model can generate new images of the same object with variations in pose and lighting. We present qualitative and quantitative results of the model's efficacy at learning a 3D rendering engine.

Deep Convolutional Inverse Graphics Network

TL;DR

DC-IGN tackles learning interpretable, disentangled representations that separate pose, lighting, and intrinsic properties to enable controllable image re-rendering. It uses a convolutional encoder–decoder trained with Stochastic Gradient Variational Bayes, plus a targeted training procedure that assigns specific latent groups to distinct transformations. The approach yields a functioning 3D rendering engine capable of novel-view synthesis on 3D faces and chairs, outperforming entangled baselines in representing transformations. This work advances inverse graphics by providing an end-to-end, data-driven method for automatic disentanglement without explicit supervision.

Abstract

This paper presents the Deep Convolution Inverse Graphics Network (DC-IGN), a model that learns an interpretable representation of images. This representation is disentangled with respect to transformations such as out-of-plane rotations and lighting variations. The DC-IGN model is composed of multiple layers of convolution and de-convolution operators and is trained using the Stochastic Gradient Variational Bayes (SGVB) algorithm. We propose a training procedure to encourage neurons in the graphics code layer to represent a specific transformation (e.g. pose or light). Given a single input image, our model can generate new images of the same object with variations in pose and lighting. We present qualitative and quantitative results of the model's efficacy at learning a 3D rendering engine.

Paper Structure

This paper contains 10 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Model Architecture: Deep Convolutional Inverse Graphics Network (DC-IGN) has an encoder and a decoder. We follow the variational autoencoder kingma2013auto architecture with variations. The encoder consists of several layers of convolutions followed by max-pooling and the decoder has several layers of unpooling (upsampling using nearest neighbors) followed by convolution. (a) During training, data $x$ is passed through the encoder to produce the posterior approximation $Q(z_i|x)$, where $z_i$ consists of scene latent variables such as pose, light, texture or shape. In order to learn parameters in DC-IGN, gradients are back-propagated using stochastic gradient descent using the following variational object function: $-log(P(x|z_i)) + KL(Q(z_i|x)||P(z_i))$ for every $z_i$. We can force DC-IGN to learn a disentangled representation by showing mini-batches with a set of inactive and active transformations (e.g. face rotating, light sweeping in some direction etc). (b) During test, data $x$ can be passed through the encoder to get latents $z_i$. Images can be re-rendered to different viewpoints, lighting conditions, shape variations etc by just manipulating the appropriate graphics code group $(z_i)$, which is how one would manipulate an off-the-shelf 3D graphics engine.
  • Figure 2: Structure of the representation vector.$\phi$ is the azimuth of the face, $\alpha$ is the elevation of the face with respect to the camera, and $\phi_L$ is the azimuth of the light source.
  • Figure 3: Training on a minibatch in which only $\phi$, the azimuth angle of the face, changes. During the forward step, the output from each component $z_i \neq z_1$ of the encoder is altered to be the same for each sample in the batch. This reflects the fact that the generating variables of the image (e.g. the identity of the face) which correspond to the desired values of these latents are unchanged throughout the batch. By holding these outputs constant throughout the batch, the single neuron $z_1$ is forced to explain all the variance within the batch, i.e. the full range of changes to the image caused by changing $\phi$. During the backward step $z_1$ is the only neuron which receives a gradient signal from the attempted reconstruction, and all $z_i \neq z_1$ receive a signal which nudges them to be closer to their respective averages over the batch. During the complete training process, after this batch, another batch is selected at random; it likewise contains variations of only one of ${\phi, \alpha, \phi_L, intrinsic}$; all neurons which do not correspond to the selected latent are clamped; and the training proceeds.
  • Figure 4: Manipulating pose variables: Qualitative results showing the generalization capability of the learned DC-IGN decoder to rerender a single input image with different pose directions. (a) We change the latent $z_{elevation}$ smoothly from -15 to 15, leaving all 199 other latents unchanged. (b) We change the latent $z_{azimuth}$ smoothly from -15 to 15, leaving all 199 other latents unchanged.
  • Figure 5: (a) Manipulating light variables: Qualitative results showing the generalization capability of the learnt DC-IGN decoder to render original static image with different light directions. The latent neuron $z_{light}$ is changed to random values but all other latents are clamped. (b) Entangled versus disentangled representations.Top: Original reconstruction (left) and transformed (right) using a normally-trained network. Bottom: The same transformation using the DC-IGN.
  • ...and 2 more figures