Table of Contents
Fetching ...

SketchDream: Sketch-based Text-to-3D Generation and Editing

Feng-Lin Liu, Hongbo Fu, Yu-Kun Lai, Lin Gao

TL;DR

SketchDream presents a unified framework for sketch-based text-to-3D generation and editing of NeRF-based content. It combines a sketch-conditioned multi-view diffusion model with depth-guided warping and a 3D attention module to ensure cross-view consistency, optimized via 3D SDS and 2D ISM losses. A coarse-to-fine editing pipeline enables precise local edits while preserving unedited regions by translating 2D masks into 3D masks and refining through mesh-based labeling and local enhancements. Extensive experiments show superior generation quality and editing fidelity over 2D-then-3D baselines and state-of-the-art editing approaches. This approach advances intuitive, controllable 3D content creation from simple sketches and text, with practical implications for rapid design and customization.

Abstract

Existing text-based 3D generation methods generate attractive results but lack detailed geometry control. Sketches, known for their conciseness and expressiveness, have contributed to intuitive 3D modeling but are confined to producing texture-less mesh models within predefined categories. Integrating sketch and text simultaneously for 3D generation promises enhanced control over geometry and appearance but faces challenges from 2D-to-3D translation ambiguity and multi-modal condition integration. Moreover, further editing of 3D models in arbitrary views will give users more freedom to customize their models. However, it is difficult to achieve high generation quality, preserve unedited regions, and manage proper interactions between shape components. To solve the above issues, we propose a text-driven 3D content generation and editing method, SketchDream, which supports NeRF generation from given hand-drawn sketches and achieves free-view sketch-based local editing. To tackle the 2D-to-3D ambiguity challenge, we introduce a sketch-based multi-view image generation diffusion model, which leverages depth guidance to establish spatial correspondence. A 3D ControlNet with a 3D attention module is utilized to control multi-view images and ensure their 3D consistency. To support local editing, we further propose a coarse-to-fine editing approach: the coarse phase analyzes component interactions and provides 3D masks to label edited regions, while the fine stage generates realistic results with refined details by local enhancement. Extensive experiments validate that our method generates higher-quality results compared with a combination of 2D ControlNet and image-to-3D generation techniques and achieves detailed control compared with existing diffusion-based 3D editing approaches.

SketchDream: Sketch-based Text-to-3D Generation and Editing

TL;DR

SketchDream presents a unified framework for sketch-based text-to-3D generation and editing of NeRF-based content. It combines a sketch-conditioned multi-view diffusion model with depth-guided warping and a 3D attention module to ensure cross-view consistency, optimized via 3D SDS and 2D ISM losses. A coarse-to-fine editing pipeline enables precise local edits while preserving unedited regions by translating 2D masks into 3D masks and refining through mesh-based labeling and local enhancements. Extensive experiments show superior generation quality and editing fidelity over 2D-then-3D baselines and state-of-the-art editing approaches. This approach advances intuitive, controllable 3D content creation from simple sketches and text, with practical implications for rapid design and customization.

Abstract

Existing text-based 3D generation methods generate attractive results but lack detailed geometry control. Sketches, known for their conciseness and expressiveness, have contributed to intuitive 3D modeling but are confined to producing texture-less mesh models within predefined categories. Integrating sketch and text simultaneously for 3D generation promises enhanced control over geometry and appearance but faces challenges from 2D-to-3D translation ambiguity and multi-modal condition integration. Moreover, further editing of 3D models in arbitrary views will give users more freedom to customize their models. However, it is difficult to achieve high generation quality, preserve unedited regions, and manage proper interactions between shape components. To solve the above issues, we propose a text-driven 3D content generation and editing method, SketchDream, which supports NeRF generation from given hand-drawn sketches and achieves free-view sketch-based local editing. To tackle the 2D-to-3D ambiguity challenge, we introduce a sketch-based multi-view image generation diffusion model, which leverages depth guidance to establish spatial correspondence. A 3D ControlNet with a 3D attention module is utilized to control multi-view images and ensure their 3D consistency. To support local editing, we further propose a coarse-to-fine editing approach: the coarse phase analyzes component interactions and provides 3D masks to label edited regions, while the fine stage generates realistic results with refined details by local enhancement. Extensive experiments validate that our method generates higher-quality results compared with a combination of 2D ControlNet and image-to-3D generation techniques and achieves detailed control compared with existing diffusion-based 3D editing approaches.
Paper Structure (27 sections, 8 equations, 12 figures, 2 tables)

This paper contains 27 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: The overview of our SketchDream for sketch-based generation and editing. Given an input sketch $S$ and a text prompt $y$, we design a sketch-based multi-view diffusion model (a), which takes $S$, depth-warped sketch $S_1$, and white images $S_{\varnothing}$ as conditions and generates multi-view images in the sketch view $c_s$ and novel views $c_{mv}$. In order to generate realistic 3D contents (b), we render images under five views corresponding to those in the multi-view diffusion model and then optimize a NeRF by 3D score distillation and 2D interval score matching. For sketch-based 3D editing (c), we design a two-stage editing framework. In the coarse stage, we build a coarse 3D mask and generates a coarse editing result, which is used to get precise 3D masks for high-quality local editing in the fine stage.
  • Figure 2: Sketch-based generation results. Given hand-drawn sketches, our method generates high-quality 3D results, which are faithful to the input sketches and texts. Our method can generate models under diverse categories, including clothes, food, animals, humanoid objects, etc. It can been seen that the shape and pattern details can be controlled by sketches.
  • Figure 3: Sketch-based multi-view image generation results. Given hand-drawn sketches (a) and texts shown above the images, our method generates realistic multi-view images (b), which are faithful to the input sketches and text prompts.
  • Figure 4: Sketch-based generation results with different text prompts. Our method generates diverse and realistic results, whose geometry is controlled by the input sketch while appearance being controlled by the text prompts.
  • Figure 5: Sketch-based editing results. Given real 3D models, users can select arbitrary views to render images and edit the local regions by inputting texts, modifying sketches (extracted from original render images), and drawing masks. Our method supports the change of local components of real models, such as changing the lion's head orientation (a) opening the treasure chest (b), and changing clothes (c). Our method also supports adding new high-quality components with natural interactions with the original components.
  • ...and 7 more figures