Table of Contents
Fetching ...

Text to Sketch Generation with Multi-Styles

Tengjie Li, Shikui Tu, Lei Xu

TL;DR

This work introduces M3S, a training-free diffusion-based framework for zero-shot sketch synthesis with explicit multi-style control. It achieves this by injecting reference style features into self-attention via a K/V fusion scheme with linear smoothing, and by employing a style-content guidance mechanism along with a joint AdaIN module to regulate style tendency. The method supports single- and multi-style generation, delivering high style fidelity and preserved content while enabling flexible interpolation between styles. Extensive experiments across six sketch datasets demonstrate strong text alignment, style consistency, and competitive human preferences, with SDXL-based variants offering especially robust performance for diverse artistic styles. The approach has practical impact for artists and designers seeking rapid, controllable sketch generation across varied stylistic regimes.

Abstract

Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.

Text to Sketch Generation with Multi-Styles

TL;DR

This work introduces M3S, a training-free diffusion-based framework for zero-shot sketch synthesis with explicit multi-style control. It achieves this by injecting reference style features into self-attention via a K/V fusion scheme with linear smoothing, and by employing a style-content guidance mechanism along with a joint AdaIN module to regulate style tendency. The method supports single- and multi-style generation, delivering high style fidelity and preserved content while enabling flexible interpolation between styles. Extensive experiments across six sketch datasets demonstrate strong text alignment, style consistency, and competitive human preferences, with SDXL-based variants offering especially robust performance for diverse artistic styles. The approach has practical impact for artists and designers seeking rapid, controllable sketch generation across varied stylistic regimes.

Abstract

Recent advances in vision-language models have facilitated progress in sketch generation. However, existing specialized methods primarily focus on generic synthesis and lack mechanisms for precise control over sketch styles. In this work, we propose a training-free framework based on diffusion models that enables explicit style guidance via textual prompts and referenced style sketches. Unlike previous style transfer methods that overwrite key and value matrices in self-attention, we incorporate the reference features as auxiliary information with linear smoothing and leverage a style-content guidance mechanism. This design effectively reduces content leakage from reference sketches and enhances synthesis quality, especially in cases with low structural similarity between reference and target sketches. Furthermore, we extend our framework to support controllable multi-style generation by integrating features from multiple reference sketches, coordinated via a joint AdaIN module. Extensive experiments demonstrate that our approach achieves high-quality sketch generation with accurate style alignment and improved flexibility in style control. The official implementation of M3S is available at https://github.com/CMACH508/M3S.

Paper Structure

This paper contains 33 sections, 10 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Top: Synthesized sketches from a specific-style exemplar by our proposed method. Bottom: Multi-style sketches generated by our framework. $\eta$ is used to control style tendency. As $\eta$ increases, the result's style becomes more aligned with the referenced style 2, and vice versa.
  • Figure 2: Pipeline of the proposed M3S. Given the referenced style sketches $\textbf{I}_{ref_1}$ and $\textbf{I}_{ref_2}$, we invert the two images into the latent space, resulting in latents $\textbf{z}_{t}^{ref_1}$ and $\textbf{z}_{t}^{ref_2}$. The referenced $K/V$ features are extracted from these latents and employed as auxiliary information in self-attention layers (Section \ref{['section: inject style']}) for generating target images $\textbf{I}_{tar}$. A style-content guidance (Section \ref{['section:Style-Content']}) is applied to balance the fidelity and style consistency. We apply a joint AdaIN module to control the style tendency (Section \ref{['section: joint AdaIN']}). Generating a single style sketch is a special case in the figure, i.e., blocking out the top or bottom branches.
  • Figure 3: (a) Examples of generated results by different $K/V$ injection method. Direct $K/V$ substitution and AdaIN constraints (i.e., StyleAligned Style_Aligned) introduce visual artifacts (chaotic strokes), whereas our feature concatenation strategy improves line quality. Further incorporating linear blending enhances structural coherence by mitigating content leakage. (b) Counter-based regulation guidance (Section \ref{['secton:CRG']}) achieves effective artifact suppression with a controlled trade-off in stroke fidelity.
  • Figure 4: Qualitative comparison of different methods. Most evaluation cases are challenging cross-domain synthesis scenarios. The referenced images in columns 1-4 are from Style 1-4, columns 5-6 are from Style 5, and columns 7-8 are from Style 6.
  • Figure 5: Examples of generated sketches with two referenced style images. Top: Same prompts are used in each row, and the prompts are in the Appendix. We set the style tendency $\eta=0.5$ in these cases. Bottom: Results of different $\eta$ to control the style tendency.
  • ...and 11 more figures