Table of Contents
Fetching ...

DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields

Yu Chi, Fangneng Zhan, Sibo Wu, Christian Theobalt, Adam Kortylewski

TL;DR

DatasetNeRF addresses the data-hungry requirement of 3D vision by generating large-scale, 3D-consistent annotations from a small set of 2D labels. It builds a semantic segmentation branch on top of a pretrained 3D GAN backbone (EG3D) using an augmented semantic tri-plane, depth- and density-priors, and volumetric rendering to produce multi-view 2D masks and back-projected 3D point clouds. The approach supports both articulated and non-articulated radiance fields and enables 3D-aware editing and inversion, with demonstrated improvements in 3D consistency and segmentation accuracy over baselines on AFHQ-Cat, FFHQ, AIST++, Nersemble, and ShapeNet-Car datasets. By enabling efficient generation of 3D-aware data, DatasetNeRF offers a practical path for data augmentation and downstream 3D tasks with limited human labeling, potentially accelerating 3D vision development and deployment.

Abstract

Progress in 3D computer vision tasks demands a huge amount of data, yet annotating multi-view images with 3D-consistent annotations, or point clouds with part segmentation is both time-consuming and challenging. This paper introduces DatasetNeRF, a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations, while utilizing minimal 2D human-labeled annotations. Specifically, we leverage the strong semantic prior within a 3D generative model to train a semantic decoder, requiring only a handful of fine-grained labeled samples. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data. The generated data is applicable across various computer vision tasks, including video segmentation and 3D point cloud segmentation. Our approach not only surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision on individual images, but also demonstrates versatility by being applicable to both articulated and non-articulated generative models. Furthermore, we explore applications stemming from our approach, such as 3D-aware semantic editing and 3D inversion.

DatasetNeRF: Efficient 3D-aware Data Factory with Generative Radiance Fields

TL;DR

DatasetNeRF addresses the data-hungry requirement of 3D vision by generating large-scale, 3D-consistent annotations from a small set of 2D labels. It builds a semantic segmentation branch on top of a pretrained 3D GAN backbone (EG3D) using an augmented semantic tri-plane, depth- and density-priors, and volumetric rendering to produce multi-view 2D masks and back-projected 3D point clouds. The approach supports both articulated and non-articulated radiance fields and enables 3D-aware editing and inversion, with demonstrated improvements in 3D consistency and segmentation accuracy over baselines on AFHQ-Cat, FFHQ, AIST++, Nersemble, and ShapeNet-Car datasets. By enabling efficient generation of 3D-aware data, DatasetNeRF offers a practical path for data augmentation and downstream 3D tasks with limited human labeling, potentially accelerating 3D vision development and deployment.

Abstract

Progress in 3D computer vision tasks demands a huge amount of data, yet annotating multi-view images with 3D-consistent annotations, or point clouds with part segmentation is both time-consuming and challenging. This paper introduces DatasetNeRF, a novel approach capable of generating infinite, high-quality 3D-consistent 2D annotations alongside 3D point cloud segmentations, while utilizing minimal 2D human-labeled annotations. Specifically, we leverage the strong semantic prior within a 3D generative model to train a semantic decoder, requiring only a handful of fine-grained labeled samples. Once trained, the decoder efficiently generalizes across the latent space, enabling the generation of infinite data. The generated data is applicable across various computer vision tasks, including video segmentation and 3D point cloud segmentation. Our approach not only surpasses baseline models in segmentation quality, achieving superior 3D consistency and segmentation precision on individual images, but also demonstrates versatility by being applicable to both articulated and non-articulated generative models. Furthermore, we explore applications stemming from our approach, such as 3D-aware semantic editing and 3D inversion.
Paper Structure (22 sections, 3 equations, 13 figures, 5 tables)

This paper contains 22 sections, 3 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: DatasetNeRF Pipeline Overview: (1) The manual creation of a small set of multi-view consistent annotations, followed by the training of a semantic segmentation branch using a pretrained 3D GAN backbone. (2) Leveraging the latent space's generalizability to produce an infinite array of 3D-consistent, fine-grained annotations. (3) Employing a depth prior from the 3D GAN backbone to back-project 2D segmentations to 3D point cloud segmentations.
  • Figure 2: Overall Architecture of DatasetNeRF. The DatasetNeRF architecture unifies a pretrained EG3D model with a semantic segmentation branch, comprising an enhanced semantic tri-plane, a semantic feature decoder, and a semantic super-resolution module. The semantic feature tri-plane is constructed by reshaping the concatenated outputs from all synthesis blocks of the EG3D generator. The semantic feature decoder interprets aggregated features from semantic tri-plane into a 32-channel semantic feature for every point. The semantic feature map is rendered by semantic volumetric rendering. We incorporate a density prior from the pretrained RGB decoder during the rendering process to enhance 3D consistency. The semantic super-resolution module upscales and refines the rendered semantic feature map into the final semantic output. The combination of the semantic mask output and the upsampled depth map from the pretrained EG3D model enables an efficient process for back-projecting the semantic mask, thereby facilitating the generation of point cloud part segmentation.
  • Figure 3: The illustration of multi-view point cloud fusion.
  • Figure 4: Examples of synthesized image-annotation pairs from 3D-aware data factoty.
  • Figure 5: Visualization of real-world human face point cloud segmentations.
  • ...and 8 more figures