Table of Contents
Fetching ...

DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models

Ximing Xing, Chuang Wang, Haitao Zhou, Jing Zhang, Qian Yu, Dong Xu

TL;DR

DiffSketcher tackles the problem of generating free-hand vector sketches from text by leveraging a pretrained latent diffusion prior and a differentiable rasterizer to optimize Bézier strokes. The approach extends Score Distillation Sampling (SDS) to guide vector-graphic parameters, and introduces an attention-guided stroke initialization plus an opacity-aware, augmentation-based loss to capture brush-like quality while preserving semantic fidelity. Key contributions include (1) a text-to-sketch diffusion framework for object- and scene-level vector sketches, (2) three quality-enhancing strategies (extended SDS, attention-based initialization, and stroke opacity), and (3) comprehensive experiments showing improved semantic alignment, aesthetics, and recognizability over prior methods. The method enables efficient, scalable text-driven vector sketching with potential applications in design and education, and highlights future directions for better abstractness control and style transfer.

Abstract

Even though trained mainly on images, we discover that pretrained diffusion models show impressive power in guiding sketch synthesis. In this paper, we present DiffSketcher, an innovative algorithm that creates \textit{vectorized} free-hand sketches using natural language input. DiffSketcher is developed based on a pre-trained text-to-image diffusion model. It performs the task by directly optimizing a set of Bézier curves with an extended version of the score distillation sampling (SDS) loss, which allows us to use a raster-level diffusion model as a prior for optimizing a parametric vectorized sketch generator. Furthermore, we explore attention maps embedded in the diffusion model for effective stroke initialization to speed up the generation process. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual details of the subject drawn. Our experiments show that DiffSketcher achieves greater quality than prior work. The code and demo of DiffSketcher can be found at https://ximinng.github.io/DiffSketcher-project/.

DiffSketcher: Text Guided Vector Sketch Synthesis through Latent Diffusion Models

TL;DR

DiffSketcher tackles the problem of generating free-hand vector sketches from text by leveraging a pretrained latent diffusion prior and a differentiable rasterizer to optimize Bézier strokes. The approach extends Score Distillation Sampling (SDS) to guide vector-graphic parameters, and introduces an attention-guided stroke initialization plus an opacity-aware, augmentation-based loss to capture brush-like quality while preserving semantic fidelity. Key contributions include (1) a text-to-sketch diffusion framework for object- and scene-level vector sketches, (2) three quality-enhancing strategies (extended SDS, attention-based initialization, and stroke opacity), and (3) comprehensive experiments showing improved semantic alignment, aesthetics, and recognizability over prior methods. The method enables efficient, scalable text-driven vector sketching with potential applications in design and education, and highlights future directions for better abstractness control and style transfer.

Abstract

Even though trained mainly on images, we discover that pretrained diffusion models show impressive power in guiding sketch synthesis. In this paper, we present DiffSketcher, an innovative algorithm that creates \textit{vectorized} free-hand sketches using natural language input. DiffSketcher is developed based on a pre-trained text-to-image diffusion model. It performs the task by directly optimizing a set of Bézier curves with an extended version of the score distillation sampling (SDS) loss, which allows us to use a raster-level diffusion model as a prior for optimizing a parametric vectorized sketch generator. Furthermore, we explore attention maps embedded in the diffusion model for effective stroke initialization to speed up the generation process. The generated sketches demonstrate multiple levels of abstraction while maintaining recognizability, underlying structure, and essential visual details of the subject drawn. Our experiments show that DiffSketcher achieves greater quality than prior work. The code and demo of DiffSketcher can be found at https://ximinng.github.io/DiffSketcher-project/.
Paper Structure (26 sections, 5 equations, 17 figures, 1 table)

This paper contains 26 sections, 5 equations, 17 figures, 1 table.

Figures (17)

  • Figure 1: Top: Visualizations of the vector sketches generated by our proposed method, DiffSketcher. Bottom: Visualizations of the drawing process. For each example, we show two sketches with a different number of strokes.
  • Figure 2: Various free-hand sketches synthesized by DiffSketcher and the corresponding description prompts. DiffSketcher obtains prior information from LDM ldm_2022_Rombach composite images through score distillation poole_2023_dreamfusion and achieves the same heavy and light drawing styles as human sketches by performing gradient descent on a set of Bézier curves with the opacity property. Our proposed DiffSketcher allows for varying levels of abstraction while matching its corresponding textual semantics. In each example, given the same text prompt and two different random seeds, two sketches with a different number of strokes are generated. The red words represent the cross-attention index used to initialize the control points (details about cross-attention are covered in Section \ref{['sec:stroke_init']}).
  • Figure 3: The overview of the pipeline. DiffSketcher accepts a set of control points (the locations of the strokes) and text prompts as input to generate a hand-drawn sketch.
  • Figure 4: Optimization overview. To synthesize a sketch that matches the given text prompt, we optimize the parameters of the differentiable rasterizer $\mathcal{R}$ that produces the raster sketch $\mathcal{S}$, such that the resulting sketch is close to a sample from the frozen latent diffusion model (the blue part of the picture). Since the diffusion model directly predicts the update direction, we do not need to backpropagate through the diffusion model; the model simply acts like an efficient, frozen critic that predicts image-space edits.
  • Figure 5: Strokes Initialization. The blue part of the figure represents the UNet in the LDM, which has two types of attention mechanisms: self-attention and cross-attention. The yellow and green parts respectively depict the visualization results of the cross-attention and self-attention. The gray part shows how the initial strokes are generated using a fused attention map. The dashed box represents the attention fusion, which is composed of the mean of the self-attention map and the cross-attention map corresponding to the $5$-th text prompt token ("Tower"). We start at $1$-th token, because $0$-th token is taken up by the CLIP starting token.
  • ...and 12 more figures