Table of Contents
Fetching ...

GO-NeRF: Generating Objects in Neural Radiance Fields for Virtual Reality Content Creation

Peng Dai, Feitong Tan, Xin Yu, Yifan Peng, Yinda Zhang, Xiaojuan Qi

TL;DR

A novel pipeline featuring an intuitive interface and a compositional rendering formulation that effectively integrates the generated 3D objects into the scene, utilizing optimized 3D-aware opacity maps to avoid unintended modifications to the original scene.

Abstract

Virtual environments (VEs) are pivotal for virtual, augmented, and mixed reality systems. Despite advances in 3D generation and reconstruction, the direct creation of 3D objects within an established 3D scene (represented as NeRF) for novel VE creation remains a relatively unexplored domain. This process is complex, requiring not only the generation of high-quality 3D objects but also their seamless integration into the existing scene. To this end, we propose a novel pipeline featuring an intuitive interface, dubbed GO-NeRF. Our approach takes text prompts and user-specified regions as inputs and leverages the scene context to generate 3D objects within the scene. We employ a compositional rendering formulation that effectively integrates the generated 3D objects into the scene, utilizing optimized 3D-aware opacity maps to avoid unintended modifications to the original scene. Furthermore, we develop tailored optimization objectives and training strategies to enhance the model's ability to capture scene context and mitigate artifacts, such as floaters, that may occur while optimizing 3D objects within the scene. Extensive experiments conducted on both forward-facing and 360o scenes demonstrate the superior performance of our proposed method in generating objects that harmonize with surrounding scenes and synthesizing high-quality novel view images. We are committed to making our code publicly available.

GO-NeRF: Generating Objects in Neural Radiance Fields for Virtual Reality Content Creation

TL;DR

A novel pipeline featuring an intuitive interface and a compositional rendering formulation that effectively integrates the generated 3D objects into the scene, utilizing optimized 3D-aware opacity maps to avoid unintended modifications to the original scene.

Abstract

Virtual environments (VEs) are pivotal for virtual, augmented, and mixed reality systems. Despite advances in 3D generation and reconstruction, the direct creation of 3D objects within an established 3D scene (represented as NeRF) for novel VE creation remains a relatively unexplored domain. This process is complex, requiring not only the generation of high-quality 3D objects but also their seamless integration into the existing scene. To this end, we propose a novel pipeline featuring an intuitive interface, dubbed GO-NeRF. Our approach takes text prompts and user-specified regions as inputs and leverages the scene context to generate 3D objects within the scene. We employ a compositional rendering formulation that effectively integrates the generated 3D objects into the scene, utilizing optimized 3D-aware opacity maps to avoid unintended modifications to the original scene. Furthermore, we develop tailored optimization objectives and training strategies to enhance the model's ability to capture scene context and mitigate artifacts, such as floaters, that may occur while optimizing 3D objects within the scene. Extensive experiments conducted on both forward-facing and 360o scenes demonstrate the superior performance of our proposed method in generating objects that harmonize with surrounding scenes and synthesizing high-quality novel view images. We are committed to making our code publicly available.
Paper Structure (13 sections, 9 equations, 7 figures, 1 table)

This paper contains 13 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Virtual objects generation pipeline. Left: we offer a user-friendly interface for specifying generation regions in the pre-trained 3D scene. Specifically, users can effortlessly define a 3D bounding box by selecting three points on the image. This is achieved by employing perspective projection and cross-product operations. Right: our approach separates scene rendering (up) and object generation (down) processes, which are subsequently combined in the rendered image space. The scene rendering phase generates RGB-D images $(S, D)$ of the 3D scene using pre-defined cameras $C$. The object generation step optimizes a neural radiance field within the 3D box to produce RGB images $G_v$ (Eq. \ref{['equ:volume_rendering']}) and opacity maps $O_v$ (Eq. \ref{['equ:opacity']}) through volume rendering techniques. Subsequently, the final output $I_v$ (Eq. \ref{['equ:deferred_rendering']}) is created by blending the scene and generated content using optimized opacity maps. Throughout the optimization, we meticulously design loss functions and training strategies to ensure the delivery of high-quality composited results.
  • Figure 2: Qualitative comparison. We compare our method with other baselines on forward-facing and $360^o$ scenes. The first row displays the 3D box alongside its corresponding 2D mask in image space, while the subsequent rows present the results of various methods. Blended-NeRF tends to produce unrealistic and disharmonious results, such as fruits floating in the air. Spin-NeRF$^*$ failed in stylized scenes and $360^o$ scenes with large view changes. Moreover, manual placement is tedious and ignores the influence of scene context. In contrast, our method excels across all scenes, producing cats with different appearances and fruits on the table, accompanied by plausible shadows that enhance overall composition quality. At the bottom right of the last row, we visualize the optimized opacity maps that precisely describe the silhouette of generated content.
  • Figure 3: Ablation studies. (a) Our proposed inpainting SDS loss effectively utilizes the scene context to generate a cat with an accurate shape and pose, whereas the standard SDS loss only produces a cat's head. (b) We present rendered RGB images alongside their corresponding saturation maps at the bottom left, where bright regions indicate high saturation values. Without constraining the saturation values, the generated backpack appears over-saturated. (c) The generated content is marred by artifacts that closely resemble the scene background, making their removal challenging. The opacity maps in the second row provide a clear visualization of this issue. The best results are achieved when employing both sparsity loss and background augmentation. (d) The coarse-to-fine optimization strategy improves the generation of compact and view-consistent objects, as exemplified by the chair's shape when viewed from different camera perspectives.
  • Figure 4: Stereoscopic results for VR & 3D reconstruction. (a) We render the new scene with generated 3D objects into stereoscopic results and visualize their stereo effects by predicting the disparity map from rendered left- and right-view images using stereo transformer li2021revisiting. The resulting disparity maps exhibit sharp details with clear foreground and background distinctions, indicating high-quality stereo effects. (b) We render the scene with generated virtual objects from different camera perspectives. Following the recent work tancik2023nerfstudio, we use those rendered multi-view images to refine NeRF optimization and extract the underlying 3D mesh. The successful 3D mesh reconstruction showcases consistency across diverse camera views.
  • Figure 5: Results of various application scenarios where our Go-NeRF framework aids. (a) Our method successfully generates suspended objects, exemplified by a bird flying in the air, and captured by various camera perspectives. (b) Our method facilitates generating multiple objects within an established 3D scene. (c) Style adaptation is seamlessly integrated. By utilizing a reference image as a guide, the generated object mirrors the visual characteristics of the reference image. (d) Editing capabilities are robust. By adjusting the input text prompt, we can easily customize the appearance of generated objects, such as altering the bird's color. Furthermore, the decomposed representation of the scene and objects allows for effortless rearrangement of generated elements.
  • ...and 2 more figures