Table of Contents
Fetching ...

Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes

Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Das, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song

TL;DR

This paper addresses generating precise 3D shapes from abstract freehand sketches without requiring paired sketch-3D datasets. It introduces a part-aware implicit representation and a latent diffusion model operating in a shared INR latent space, with cross-modal correspondence established by unsupervised part discovery and sketch-to-shape alignment via CLIPasso-derived edgemaps. The same part-level decoder supports both sketch modelling and 3D generation, and enables in-position editing by local part manipulation. The approach achieves efficient, high-fidelity 3D generation from highly abstract inputs, outperforming state-of-the-art sketch-conditioned methods on conditional metrics while generalizing to hand-drawn sketches and multi-view prompts. This paves the way for accessible, editable 3D content creation with reduced data requirements and computation.

Abstract

In this paper, we democratise 3D content creation, enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder, our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions, eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally, our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space, our approach significantly reduces computational demands and processing time.

Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes

TL;DR

This paper addresses generating precise 3D shapes from abstract freehand sketches without requiring paired sketch-3D datasets. It introduces a part-aware implicit representation and a latent diffusion model operating in a shared INR latent space, with cross-modal correspondence established by unsupervised part discovery and sketch-to-shape alignment via CLIPasso-derived edgemaps. The same part-level decoder supports both sketch modelling and 3D generation, and enables in-position editing by local part manipulation. The approach achieves efficient, high-fidelity 3D generation from highly abstract inputs, outperforming state-of-the-art sketch-conditioned methods on conditional metrics while generalizing to hand-drawn sketches and multi-view prompts. This paves the way for accessible, editable 3D content creation with reduced data requirements and computation.

Abstract

In this paper, we democratise 3D content creation, enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder, our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions, eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally, our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space, our approach significantly reduces computational demands and processing time.
Paper Structure (18 sections, 8 equations, 13 figures, 3 tables)

This paper contains 18 sections, 8 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Decomposing shapes into latents, we shuffle $m$ part indices of each latent $Z \in \mathbb{R}^{m \times d}$ for minimal Wasserstein distance kantorovich1960mathematical with corresponding parts in template latent $Z_t \in \mathbb{R}^{m \times d}$.
  • Figure 2: Part-level segmentation maps are created by segmenting 3D shapes into parts with part Gaussians and rendering individual 3D shape parts on synthetic sketches. This segments sketch regions based on shape parts of their corresponding 3D shapes.
  • Figure 3: Model overview: (a) The diffusion pipeline denoises latent vector $z_t \in \mathbb{R}^{m \times d}$ to $z_{t-1}$ with fully connected layers $f_d$ and a multi-head attention module $\mathcal{C}$ at time step $t$. After $t=0$, the fully denoised vector $z_0$ corresponds to a generated part-latent $Z$. (b) We encode sketches as part-disentangled representations with encoder $f_s$ by segmenting them into segment maps of individual parts with shared decoder $f_s'$. These sketch representations are fed to the attention module $\mathcal{C}$ as a Query with intermediate diffusion outputs (from $f_d$) as Key-Value pairs.
  • Figure 4: From left to right: 3D shape; an edgemap of a 2D render of the shape; corresponding sketches from the ProSketch3D zhong2020towards and AmateurSketch3D qi2021toward datasets; an abstract CLIPasso vinker2022clipasso sketch of the shape.
  • Figure 5: Qualitative comparisons of our method with LAS-D zheng2023lasdiff and SENS binninger2023sens on sketches of different levels of abstraction from (i) highly detailed sketches by artists (first 2 from ProSketch-3D zhong2020towards) and (ii) sketches by amateurs with perspective distortions (next 6 from AmateurSketch-3D qi2021toward) to (iii) Highly abstract sketches drawn in $<$20s (last from Quick-Draw!ha2017neural). While neither LAS-D nor our algorithm has seen hand-drawn doodles during training, SENSbinninger2023sens was trained on ProSketch-3D sketches.
  • ...and 8 more figures