Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

Minglin Chen; Weihao Yuan; Yukun Wang; Zhe Sheng; Yisheng He; Zilong Dong; Liefeng Bo; Yulan Guo

Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

Minglin Chen, Weihao Yuan, Yukun Wang, Zhe Sheng, Yisheng He, Zilong Dong, Liefeng Bo, Yulan Guo

TL;DR

Sketch2NeRF tackles the problem of fine-grained control in text-to-3D generation by leveraging multi-view sketches as additional constraints. It optimizes a neural radiance field (NeRF) with guidance from pretrained 2D diffusion models (Stable Diffusion and ControlNet) through a synchronized generation and reconstruction framework and an annealed time schedule, without requiring sketch-3D paired data. The method achieves superior sketch fidelity and text alignment compared with state-of-the-art baselines, demonstrated on two new multi-view sketch datasets, and shows robustness to varying numbers of sketches and viewpoints. This approach provides a practical pathway to controllable, high-fidelity 3D content generation guided by human sketches.

Abstract

Recently, text-to-3D approaches have achieved high-fidelity 3D content generation using text description. However, the generated objects are stochastic and lack fine-grained control. Sketches provide a cheap approach to introduce such fine-grained control. Nevertheless, it is challenging to achieve flexible control from these sketches due to their abstraction and ambiguity. In this paper, we present a multi-view sketch-guided text-to-3D generation framework (namely, Sketch2NeRF) to add sketch control to 3D generation. Specifically, our method leverages pretrained 2D diffusion models (e.g., Stable Diffusion and ControlNet) to supervise the optimization of a 3D scene represented by a neural radiance field (NeRF). We propose a novel synchronized generation and reconstruction method to effectively optimize the NeRF. In the experiments, we collected two kinds of multi-view sketch datasets to evaluate the proposed method. We demonstrate that our method can synthesize 3D consistent contents with fine-grained sketch control while being high-fidelity to text prompts. Extensive results show that our method achieves state-of-the-art performance in terms of sketch similarity and text alignment.

Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 9 figures, 1 table)

This paper contains 17 sections, 10 equations, 9 figures, 1 table.

Introduction
Related Work
3D Generation
Controllable Generation
Sketch-based 3D Generation
Methodology
3D Representation
Sketch-conditioned Guidance
Optimization
Implementation
Experiments
Datasets
Evaluation Metrics
Baselines
Results
...and 2 more sections

Figures (9)

Figure 1: Sketch2NeRF is a sketch-guided text-to-3D generative model that produces high-fidelity 3D objects resembling multi-view sketches. Top: our method can use an arbitrary number of sketches (usually more than 3) as input. Middle: generated 3D objects (shown as rendered RGB and normal images) of which the shapes are controlled by input sketches. Bottom: rendered RGB images at novel views. Note that, these 3D objects are generated using the same prompt of “a teapot”.
Figure 2: Sketch2NeRF Overview. We represent a 3D object using a neural radiance field (NeRF) which is optimized based on the proposed synchronized generation and reconstruction optimization method. At the generation stage, the ControlNet is used to generate real images at specific poses of sketches, while the Stable Diffusion is employed to generate real images at randomly sampled poses. At the reconstruction stage, we update the NeRF parameters such that the reconstruction loss between the generated and rendered images is minimized.
Figure 3: The rendered RGB and opacity images of the generated 3D objects w/o and w/ the random viewpoint regularization. The random viewpoint regularization effectively eliminates the near-plane artifacts and the floaters for the generated 3D objects.
Figure 4: Images generated with different levels of noise. The generated images are far different from the original when the added noise is large (e.g., $t=0.98$).
Figure 5: Qualitative comparisons on 3 different objects with four baseline methods. Results clearly indicate that our method produces better consistent and high-fidelity 3D objects with multi-view sketch control.
...and 4 more figures

Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

TL;DR

Abstract

Sketch2NeRF: Multi-view Sketch-guided Text-to-3D Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)