Table of Contents
Fetching ...

SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

Ellie Arar, Yarden Frenkel, Daniel Cohen-Or, Ariel Shamir, Yael Vinker

TL;DR

SwiftSketch tackles image-to-vector sketch generation by learning a diffusion process over stroke coordinates conditioned on image features, enabling sub-second inference with as few as $T=50$ denoising steps. It introduces ControlSketch to synthesize a large paired dataset using a depth-conditioned ControlNet, supporting training of a transformer-decoder SwiftSketch that handles discrete vector data. Results show SwiftSketch generalizes to diverse concepts, achieving high fidelity and naturalistic vector sketches while dramatically reducing generation time compared to optimization-based baselines. The work enables real-time, editable vector sketch generation and provides a scalable data-generation pipeline that can support broader research.

Abstract

Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.

SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation

TL;DR

SwiftSketch tackles image-to-vector sketch generation by learning a diffusion process over stroke coordinates conditioned on image features, enabling sub-second inference with as few as denoising steps. It introduces ControlSketch to synthesize a large paired dataset using a depth-conditioned ControlNet, supporting training of a transformer-decoder SwiftSketch that handles discrete vector data. Results show SwiftSketch generalizes to diverse concepts, achieving high fidelity and naturalistic vector sketches while dramatically reducing generation time compared to optimization-based baselines. The work enables real-time, editable vector sketch generation and provides a scalable data-generation pipeline that can support broader research.

Abstract

Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.

Paper Structure

This paper contains 31 sections, 6 equations, 41 figures, 3 tables.

Figures (41)

  • Figure 1: Amateur vs. Professional Sketches. (a) QuickDraw SketchRNN and (b) Sketchy Sangkloy2016TheSD are large-scale datasets, with Sketchy offering more fine-grained sketches, though both exhibit non-professional style. (c) OpenSketch Gryaditskaya2019OpenSketch and (d) Berger et al.Berger2013StyleAA contain professional sketches but are limited in scale and focus on specific domains.
  • Figure 2: ControlSketch Pipeline. Left: The object area is divided into $k$ regions (c), with $n$ points distributed based on attention values from (b) while ensuring a minimum allocation per region. (d) The initial strokes are derived from these points. Right: The initial strokes are iteratively optimized to form the sketch. At each iteration, the rasterized sketch is noised based on $t$ and $\epsilon$ and fed into a diffusion model with a depth ControlNet conditioned on the image's depth and caption $y$. The predicted noise $\hat{\epsilon}$ is used for the SDS loss.
  • Figure 3: (a) Input image. (b) Object mask. (c) The object's contour is extracted from the mask using morphological operations, and sketch pixels that intersect with the contour are given higher weight. (d) Attention map. (e) We sort the strokes based on a combination of contour intersection count and attention score. (f) A visualization of the first 16 strokes in the ordered sketch, demonstrating the effectiveness of our sorting scheme.
  • Figure 4: SwiftSketch Training Pipeline. At each training iteration, an image $I$ is passed through a frozen CLIP image encoder, followed by a lightweight CNN, to produce the image embedding $I_e$. The corresponding vector sketch $S^0$ is noised based on the sampled timestep $t$ and noise $\epsilon$, forming $S^t$ (with $\mathcal{R}(S^t)$ illustrating the rasterized noised sketch, which is not used in training). The network $M_\theta$, a transformer decoder, receives the noised signal $S^t$ and is tasked with predicting the clean signal $\hat{S^0}$, conditioned on the image embedding $I_e$ and the timestep $t$ (fed through the cross-attention mechanism). The network is trained with two loss functions: one based on the distance between the control points and the other on the similarity of the rasterized sketches.
  • Figure 5: Inference Process. Starting with randomly sampled Gaussian noise $S^T \sim \mathcal{N}(0, \mathbf{I})$, the model $M_{\theta}$ predicts the clean sketch $\hat{S}^0 = M_{\theta}(S^t, t, I_e)$ at each step $t$, which is then re-noised to $S^{t-1}$. This iterative process is repeated for $T$ steps and is followed by a final feed-forward pass through a refinement network, $M_{\theta^*}$, which is a trainable copy of $M_{\theta}$, specifically trained to correct very small residual noise.
  • ...and 36 more figures