Table of Contents
Fetching ...

CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images

Jian Liu, Zhen Yu

TL;DR

CtrlNeRF presents a single shared-weight MLP generative radiance field capable of representing multiple scenes and enabling explicit control over 3D geometry and appearance via label-conditioned shape and color codes. By introducing a conditional radiance field and a VGG-based auxiliary discriminator, the method achieves disentangled, controllable 3D-aware image synthesis and novel-view generation from unposed data. Empirical results on CARs, Synthetic, and LLFF datasets show memory efficiency and competitive PSNR/SSIM, with trade-offs as the number of scenes grows, and demonstrate capabilities in camera-pose-based view synthesis and feature interpolation. Overall, CtrlNeRF offers a scalable alternative to per-scene NeRF models, enabling multi-scene 3D-aware generation with explicit, label-driven control, while inviting further improvements to close the remaining quality gap with text-prompt driven approaches.

Abstract

The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.

CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images

TL;DR

CtrlNeRF presents a single shared-weight MLP generative radiance field capable of representing multiple scenes and enabling explicit control over 3D geometry and appearance via label-conditioned shape and color codes. By introducing a conditional radiance field and a VGG-based auxiliary discriminator, the method achieves disentangled, controllable 3D-aware image synthesis and novel-view generation from unposed data. Empirical results on CARs, Synthetic, and LLFF datasets show memory efficiency and competitive PSNR/SSIM, with trade-offs as the number of scenes grows, and demonstrate capabilities in camera-pose-based view synthesis and feature interpolation. Overall, CtrlNeRF offers a scalable alternative to per-scene NeRF models, enabling multi-scene 3D-aware generation with explicit, label-driven control, while inviting further improvements to close the remaining quality gap with text-prompt driven approaches.

Abstract

The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.

Paper Structure

This paper contains 14 sections, 7 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: The architecture of generative adversarial networks (GANs). G refers to a generator, and D refers to a binary discriminator.
  • Figure 2: The framework of the generative neural radiation field (CtrlNeRF), which includes three main components: embedding, generator, and discriminator.
  • Figure 3: Scheme for incorporating label codes into a latent code through multiplication.
  • Figure 4: The architecture of a conditional radiance field (MLP) comprises inputs of $z^{'}_{s}$ and $z^{'}_{a}$, as well as $\gamma(x)$ and $\gamma(d)$. The output of the model consists of a volume density array $\sigma$[] and color array c[].
  • Figure 5: Samples of the synthesized images (400x400) on CARs(I) dataset.
  • ...and 11 more figures