Table of Contents
Fetching ...

Class-Continuous Conditional Generative Neural Radiance Field

Jiwook Kim, Minhyeok Lee

TL;DR

This work introduces a novel model, called Class-Continuous Conditional Conditional Generative NeRF, which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator.

Abstract

The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fréchet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.

Class-Continuous Conditional Generative Neural Radiance Field

TL;DR

This work introduces a novel model, called Class-Continuous Conditional Conditional Generative NeRF, which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator.

Abstract

The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF (G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, G-NeRF exhibits a Fréchet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with G-NeRF.
Paper Structure (15 sections, 9 equations, 4 figures, 3 tables)

This paper contains 15 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Synthesized images of each class of AFHQ by our model (with a $\textnormal{256}^{\textnormal{2}}$ resolution). A row displays a single object with different rotation input vectors. Note that the images of different classes are generated by a single model with different conditional input vectors. Our model can generate various views of different objects that conserves strong 3D-consistency.
  • Figure 2: Overview of the proposed $\textnormal{C}^{3}$G-NeRF. Since our model is inspired by the architecture of GIRAFFE, our model generates $N-1$ objects and the background with $N$ decoders and a composition operator. D_$i$ indicates $i$th decoder and $C(\cdot)$ represents the composition operator. The decoders take a 3D coordinate vectors of positional encoding $\gamma(\textbf{x})$ and viewing direction $\gamma(\textbf{d})$, where $\gamma$ indicates positional encoding functions. In addition, the decoders take conditional vectors $\textbf{c}$, which are encoded by linear layers, shape codes $\textbf{z}_{\textbf{s}}$, and appearance codes $\textbf{z}_{\textbf{a}}$. By compositing the outputs of each decoders with the composition operator $C(\cdot)$ and then volume-renders the result. Consequently, a composited feature vector $\textbf{v}$ is produced. The feature vector $\textbf{v}$ passes the neural rendering module $\pi_{neural}$. In this process, the generator $\textit{G}(\theta)$ synthesizes a fake image $\hat{\textbf{I}}$. The discriminator $\textit{D}$ takes a real image $\textbf{I}$ or the fake image $\hat{\textbf{I}}$ projected by the conditional labels $\textbf{c}$.
  • Figure 3: Class conditional synthetic object rotation generated by $\textnormal{C}^{3}$G-NeRF trained with CelebA and Cars. In (a), each row represents a single object of CelebA with the same latent vectors. Each column indicates rotation angles. In the left figure of (a), we fixed the input conditions as a bald man, whereas we fixed the conditions as a blonde smiling woman in the right figure of (a). In (b), by controlling the horizontal and depth translation, the disentanglement of the objects and background are shown in Horizontal translation and Depth translation. After training with unstructured 2D images with a single object, we can generate $N-1$ objects in one scene by replicating $N$ decoders as in Add objects. All images have a resolution of $128^2$. Using $\textnormal{C}^{3}$G-NeRF, 3D-consistent image generation is successful under the given conditions.
  • Figure 4: Interpolation and extrapolation on conditional input values with CelebA and AFHQ. Each row and column represent the same latent vectors (identical object) of AFHQ and the same class-conditional values, respectively. In (a), we present the conditional results according to conditional input values with the range of zero to three. Note that, in training time, the features are trained only with the two values of zero and one, which indicate the existence of the corresponding feature. The features in the face images with interpolated and extrapolated input values smoothly change, which means the continuous conditional learning is adequately progressed. By interpolating the values of each class, features of each category coexist at the intermediate state of class-conditional values.