Table of Contents
Fetching ...

StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion

Jin Zhou, Yi Zhou, Hongliang Yang, Pengfei Xu, Hui Huang

TL;DR

StrokeFusion introduces a dual-modal stroke-UDF encoding to capture both geometric and raster-like cues for vector sketches, followed by a stroke-level latent diffusion model that generates unordered, variable-length stroke sets. The two-stage design decouples layout prediction from shape synthesis and enables permutation-invariant diffusion in latent space, yielding high-fidelity, editable vector sketches across diverse domains. Quantitative results on QuickDraw and other challenging datasets show consistent improvements in FID, precision, and recall, particularly for complex, multi-stroke sketches, while qualitative analyses highlight improved structural coherence and detail. The approach offers practical benefits for vector sketch creation and editing in design tools, enabling robust cross-domain generalization and controllable stroke-level generation.

Abstract

In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.

StrokeFusion: Vector Sketch Generation via Joint Stroke-UDF Encoding and Latent Sequence Diffusion

TL;DR

StrokeFusion introduces a dual-modal stroke-UDF encoding to capture both geometric and raster-like cues for vector sketches, followed by a stroke-level latent diffusion model that generates unordered, variable-length stroke sets. The two-stage design decouples layout prediction from shape synthesis and enables permutation-invariant diffusion in latent space, yielding high-fidelity, editable vector sketches across diverse domains. Quantitative results on QuickDraw and other challenging datasets show consistent improvements in FID, precision, and recall, particularly for complex, multi-stroke sketches, while qualitative analyses highlight improved structural coherence and detail. The approach offers practical benefits for vector sketch creation and editing in design tools, enabling robust cross-domain generalization and controllable stroke-level generation.

Abstract

In the field of sketch generation, raster-format trained models often produce non-stroke artifacts, while vector-format trained models typically lack a holistic understanding of sketches, leading to compromised recognizability. Moreover, existing methods struggle to extract common features from similar elements (e.g., eyes of animals) appearing at varying positions across sketches. To address these challenges, we propose StrokeFusion, a two-stage framework for vector sketch generation. It contains a dual-modal sketch feature learning network that maps strokes into a high-quality latent space. This network decomposes sketches into normalized strokes and jointly encodes stroke sequences with Unsigned Distance Function (UDF) maps, representing sketches as sets of stroke feature vectors. Building upon this representation, our framework exploits a stroke-level latent diffusion model that simultaneously adjusts stroke position, scale, and trajectory during generation. This enables high-fidelity sketch generation while supporting stroke interpolation editing. Extensive experiments on the QuickDraw dataset demonstrate that our framework outperforms state-of-the-art techniques, validating its effectiveness in preserving structural integrity and semantic features. Code and models will be made publicly available upon publication.

Paper Structure

This paper contains 45 sections, 17 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: The proposed StrokeFusion framework comprises two core components: 1) Dual-Modal Stroke Encoding: Each stroke $s$ is processed through parallel encoding paths - a transformer-based sequence encoder handles geometric coordinates while a CNN processes the stroke distance field $I_n$. These modalities are fused into joint features $f$, trained via symmetric decoder networks that reconstruct both the original stroke ($s$) and distance field ($I_n$); 2) Sketch Diffusion Generation: All normalized strokes are encoded into latent vectors $z_i$, augmented with bounding box parameters $b^i = [x^i, y^i, w^i, h^i]$ and presence flags $v^i \in \{-1,1\}$. The diffusion model learns the distribution of stroke sequences $\{\mathbf{z}_1, ..., \mathbf{z}_N\}$ through $T$-step denoising training. During generation, the denoiser progressively refines noisy latents via reverse diffusion, with valid strokes ($v^i=1$) being decoded through inverse normalization of $\hat{b}^i$ to reconstruct the final sketch. The architecture maintains permutation invariance through order-agnostic sequence processing.
  • Figure 2: Two examples of the generation process in StrokeFusion. From left to right, the noise level progressively decreases. At each timestep, only strokes with presence confidence $\hat{v}_i > 0$ are visualized.
  • Figure 3: Qualitative comparison of sketches generated by our method and the baselines across different categories in QuickDraw. Our method consistently produces more structurally coherent sketches with richer local details, particularly in complex, multi-stroke scenarios.
  • Figure 4: Qualitative generation results on several more complex datasets. Dataset names and representative examples are shown below each column for comparison.
  • Figure 5: Comparison with DoodleFormer. For each dataset we select visually similar samples from both methods.
  • ...and 6 more figures