Table of Contents
Fetching ...

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

Michael Niemeyer, Andreas Geiger

TL;DR

GIRAFFE tackles the challenge of controllable 3D-aware image synthesis by introducing compositional neural feature fields for individual objects and a scene-level additive composition. It renders scenes via a two-stage pipeline: volume rendering to low-resolution feature maps, followed by a fast 2D neural renderer to produce high-resolution RGB images, trained on unposed image collections with a GAN objective. The method enables disentanglement of objects from the background and supports test-time editing of object pose, shape, and appearance, as well as adding objects and varying camera viewpoints, all while maintaining efficiency and scalability to real-world data. Empirically, it achieves competitive FID scores and superior controllability and generalization compared to strong 3D-aware baselines, with notable speedups in rendering. This approach advances practical 3D-aware content generation by combining explicit scene compositionality with neural rendering and unsupervised training.

Abstract

Deep generative models allow for photorealistic image synthesis at high resolutions. But for many applications, this is not enough: content creation also needs to be controllable. While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional. Further, only few works consider the compositional nature of scenes. Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. Representing scenes as compositional generative neural feature fields allows us to disentangle one or multiple objects from the background as well as individual objects' shapes and appearances while learning from unstructured and unposed image collections without any additional supervision. Combining this scene representation with a neural rendering pipeline yields a fast and realistic image synthesis model. As evidenced by our experiments, our model is able to disentangle individual objects and allows for translating and rotating them in the scene as well as changing the camera pose.

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields

TL;DR

GIRAFFE tackles the challenge of controllable 3D-aware image synthesis by introducing compositional neural feature fields for individual objects and a scene-level additive composition. It renders scenes via a two-stage pipeline: volume rendering to low-resolution feature maps, followed by a fast 2D neural renderer to produce high-resolution RGB images, trained on unposed image collections with a GAN objective. The method enables disentanglement of objects from the background and supports test-time editing of object pose, shape, and appearance, as well as adding objects and varying camera viewpoints, all while maintaining efficiency and scalability to real-world data. Empirically, it achieves competitive FID scores and superior controllability and generalization compared to strong 3D-aware baselines, with notable speedups in rendering. This approach advances practical 3D-aware content generation by combining explicit scene compositionality with neural rendering and unsupervised training.

Abstract

Deep generative models allow for photorealistic image synthesis at high resolutions. But for many applications, this is not enough: content creation also needs to be controllable. While several recent works investigate how to disentangle underlying factors of variation in the data, most of them operate in 2D and hence ignore that our world is three-dimensional. Further, only few works consider the compositional nature of scenes. Our key hypothesis is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. Representing scenes as compositional generative neural feature fields allows us to disentangle one or multiple objects from the background as well as individual objects' shapes and appearances while learning from unstructured and unposed image collections without any additional supervision. Combining this scene representation with a neural rendering pipeline yields a fast and realistic image synthesis model. As evidenced by our experiments, our model is able to disentangle individual objects and allows for translating and rotating them in the scene as well as changing the camera pose.

Paper Structure

This paper contains 14 sections, 13 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview. We represent scenes as compositional generative neural feature fields. For a randomly sampled camera, we volume render a feature image of the scene based on individual feature fields. A 2D neural rendering network converts the feature image into an RGB image. While training only on raw image collections, at test time we are able to control the image formation process wrt. camera pose, object poses, as well as the objects' shapes and appearances. Further, our model generalizes beyond the training data, e.g. we can synthesize scenes with more objects than were present in the training images. Note that for clarity we visualize volumes in color instead of features.
  • Figure 2: Controllable Image Generation. While most generative models operate in 2D, we incorporate a compositional 3D scene representation into the generative model. This leads to more consistent image synthesis results, e.g. note how, in contrast to our method, translating one object might change the other when operating in 2D (Fig. \ref{['fig:control-a']} and \ref{['fig:control-b']}). It further allows us to perform complex operations like circular translations (Fig. \ref{['fig:control-c']}) or adding more objects at test time (Fig. \ref{['fig:control-d']}). Both methods are trained unsupervised on raw unposed image collections of two-object scenes.
  • Figure 3: GIRAFFE. Our generator $G_\theta$ takes a camera pose $\boldsymbol{\xi}$ and $N$ shape and appearance codes $\mathbf{z}_s^i, \mathbf{z}_a^i$ and affine transformations $\mathbf{T}_i$ as input and synthesizes an image of the generated scene which consists of $N-1$ objects and a background. The discriminator $D_\phi$ takes the generated image $\hat{\mathbf{I}}$ and the real image $\mathbf{I}$ as input and our full model is trained with an adversarial loss. At test time, we can control the camera pose, the shape and appearance codes of the objects, and the objects' poses in the scene. Orange indicates learnable and blue non-learnable operations.
  • Figure 4: Neural Rendering Operator. The feature image $\mathbf{I}_V$ is processed by $n$ blocks of nearest neighbor upsampling and $3 \times 3$ convolutions with leaky ReLU activations. At every resolution, we map the feature image to an RGB image with a $3 \times 3$ convolution and add it to the previous output via bilinear upsampling. We apply a sigmoid activation to obtain the final image $\hat{\mathbf{I}}$. Gray color indicates outputs, orange learnable, and blue non-learnable operations.
  • Figure 5: Scene Disentanglement. From top to bottom, we show only backgrounds, only objects, color-coded object alpha maps, and the final synthesized images at $64^2$ pixel resolution. Disentanglement emerges without supervision, and the model learns to generate plausible backgrounds although the training data only contains images with objects.
  • ...and 7 more figures