CtrlNeRF: The Generative Neural Radiation Fields for the Controllable Synthesis of High-fidelity 3D-Aware Images
Jian Liu, Zhen Yu
TL;DR
CtrlNeRF presents a single shared-weight MLP generative radiance field capable of representing multiple scenes and enabling explicit control over 3D geometry and appearance via label-conditioned shape and color codes. By introducing a conditional radiance field and a VGG-based auxiliary discriminator, the method achieves disentangled, controllable 3D-aware image synthesis and novel-view generation from unposed data. Empirical results on CARs, Synthetic, and LLFF datasets show memory efficiency and competitive PSNR/SSIM, with trade-offs as the number of scenes grows, and demonstrate capabilities in camera-pose-based view synthesis and feature interpolation. Overall, CtrlNeRF offers a scalable alternative to per-scene NeRF models, enabling multi-scene 3D-aware generation with explicit, label-driven control, while inviting further improvements to close the remaining quality gap with text-prompt driven approaches.
Abstract
The neural radiance field (NERF) advocates learning the continuous representation of 3D geometry through a multilayer perceptron (MLP). By integrating this into a generative model, the generative neural radiance field (GRAF) is capable of producing images from random noise z without 3D supervision. In practice, the shape and appearance are modeled by z_s and z_a, respectively, to manipulate them separately during inference. However, it is challenging to represent multiple scenes using a solitary MLP and precisely control the generation of 3D geometry in terms of shape and appearance. In this paper, we introduce a controllable generative model (i.e. \textbf{CtrlNeRF}) that uses a single MLP network to represent multiple scenes with shared weights. Consequently, we manipulated the shape and appearance codes to realize the controllable generation of high-fidelity images with 3D consistency. Moreover, the model enables the synthesis of novel views that do not exist in the training sets via camera pose alteration and feature interpolation. Extensive experiments were conducted to demonstrate its superiority in 3D-aware image generation compared to its counterparts.
