Table of Contents
Fetching ...

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

Gang Li, Heliang Zheng, Chaoyue Wang, Chang Li, Changwen Zheng, Dacheng Tao

TL;DR

The paper tackles extending text-guided diffusion models to 3D by introducing a NeRF-based conditioning module and a two-stream asynchronous diffusion backbone to achieve 3D-consistent generation. It also presents a novel 3D local editing pipeline based on noise blending and noise-to-text inversion, plus a one-shot novel view synthesis approach via single-image fine-tuning. The method, validated on RealCars and 3D-FUTURE, demonstrates improved 3D consistency, photorealism, and controllability over prior 3D diffusion approaches, and enables 360-degree edits from a single view. This work significantly advances practical, text-driven 3D asset creation and manipulation with potential applications in virtual prototyping and design workflows.

Abstract

Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve 3D-consistent generation. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study 3D local editing and propose a two-step solution that can generate 360-degree manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360-degree images can be generated. Last but not least, we extend our model to perform one-shot novel view synthesis by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at https://3ddesigner-diffusion.github.io/.

3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models

TL;DR

The paper tackles extending text-guided diffusion models to 3D by introducing a NeRF-based conditioning module and a two-stream asynchronous diffusion backbone to achieve 3D-consistent generation. It also presents a novel 3D local editing pipeline based on noise blending and noise-to-text inversion, plus a one-shot novel view synthesis approach via single-image fine-tuning. The method, validated on RealCars and 3D-FUTURE, demonstrates improved 3D consistency, photorealism, and controllability over prior 3D diffusion approaches, and enables 360-degree edits from a single view. This work significantly advances practical, text-driven 3D asset creation and manipulation with potential applications in virtual prototyping and design workflows.

Abstract

Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve 3D-consistent generation. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study 3D local editing and propose a two-step solution that can generate 360-degree manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360-degree images can be generated. Last but not least, we extend our model to perform one-shot novel view synthesis by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at https://3ddesigner-diffusion.github.io/.
Paper Structure (15 sections, 9 equations, 12 figures, 5 tables)

This paper contains 15 sections, 9 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: "3DDisigner" -- Text guided 3D object generation and editing. Given a text, e.g., "A red sedan BMW 5 series 2021", our method can 1) generate the corresponding 3D images, 2) create "SUV" and "Sports" 3D counterparts, and 3) support text-guided 3D local editing.
  • Figure 2: An illustration of our framework for text-guided 3D-consistent generation (training phase). (A) NeRF-based Condition Module, which takes $<$one coarse text, two camera views$>$ pairs as inputs and generates low-resolution coarse results. The coarse results are resized and concatenated with noised images to provide conditions for denoising. (B) Two-stream Asynchronous Diffusion Module, which takes $<$one full text, two coarse results, two timesteps, two noised images$>$ quadruples as inputs and predicts the added noises. Each stream is a vanilla text-guided diffusion model except for the feature interaction module after each attention block. Note that the timesteps are randomly generated and the parameters of these two streams are shared.
  • Figure 3: An illustration of 3D local editing. We propose to blend noises in each sampling step to achieve 2D local editing and conduct noise-to-text inversion to generate 3D manipulated images. The notations are explained in Sec. \ref{['sec:local']}.
  • Figure 4: Our 3DDsigner can perform (A) fine-grained 3D generation, (B) semantic meaningful interpolation in the text embedding space, and (C) controllable generation that can change car types to create counterparts of real-world car models.
  • Figure 5: A set of examples consisting of image-text pairs sourced from our collected RealCars dataset.
  • ...and 7 more figures