DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields
Yu Chi, Fangneng Zhan, Sibo Wu, Christian Theobalt, Adam Kortylewski
TL;DR
DatasetNeRF addresses the data-hungry requirement of 3D vision by generating large-scale, 3D-consistent annotations from a small set of 2D labels. It builds a semantic segmentation branch on top of a pretrained 3D GAN backbone (EG3D) using an augmented semantic tri-plane, depth- and density-priors, and volumetric rendering to produce multi-view 2D masks and back-projected 3D point clouds. The approach supports both articulated and non-articulated radiance fields and enables 3D-aware editing and inversion, with demonstrated improvements in 3D consistency and segmentation accuracy over baselines on AFHQ-Cat, FFHQ, AIST++, Nersemble, and ShapeNet-Car datasets. By enabling efficient generation of 3D-aware data, DatasetNeRF offers a practical path for data augmentation and downstream 3D tasks with limited human labeling, potentially accelerating 3D vision development and deployment.
Abstract
Progress in 3D computer vision tasks demands a huge amount of data, yet annotating multi-view images with 3D-consistent annotations, or point clouds with part segmentation is both time-consuming and challenging. This paper introduces DatasetNeRF, a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations, while utilizing minimal 2D human-labeled annotations. Specifically, we leverage the strong semantic prior within a 3D generative model to train a semantic decoder, requiring only a handful of fine-grained labeled samples. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data. The generated data is applicable across various computer vision tasks, including video segmentation and 3D point cloud segmentation. Our approach not only surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision on individual images, but also demonstrates versatility by being applicable to both articulated and non-articulated generative models. Furthermore, we explore applications stemming from our approach, such as 3D-aware semantic editing and 3D inversion.
