MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention

Rui Yang; Xiaojun Wu; Shengfeng He

MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention

Rui Yang, Xiaojun Wu, Shengfeng He

TL;DR

This work tackles the challenge of producing high-quality sketches from color images without training, enabling arbitrary reference styles. It introduces Mixture-of-Self-Attention (MixSA), which injects brushstroke features from a reference sketch into late decoder layers of a latent diffusion model, while a Decomposing Contours and Texture module separates texture from contours. By controlling texture density and reference adherence through parameters, MixSA supports interpolation between styles and mitigates color-averaging artifacts inherent to diffusion models. Extensive experiments on multiple datasets show superior sketch fidelity, flexibility, and user-preferred outputs, validating the training-free approach and its practical utility for art and portrait sketching. The approach offers a versatile tool for artists and developers seeking style-consistent sketch extraction without costly retraining or data collection.

Abstract

Current sketch extraction methods either require extensive training or fail to capture a wide range of artistic styles, limiting their practical applicability and versatility. We introduce Mixture-of-Self-Attention (MixSA), a training-free sketch extraction method that leverages strong diffusion priors for enhanced sketch perception. At its core, MixSA employs a mixture-of-self-attention technique, which manipulates self-attention layers by substituting the keys and values with those from reference sketches. This allows for the seamless integration of brushstroke elements into initial outline images, offering precise control over texture density and enabling interpolation between styles to create novel, unseen styles. By aligning brushstroke styles with the texture and contours of colored images, particularly in late decoder layers handling local textures, MixSA addresses the common issue of color averaging by adjusting initial outlines. Evaluated with various perceptual metrics, MixSA demonstrates superior performance in sketch quality, flexibility, and applicability. This approach not only overcomes the limitations of existing methods but also empowers users to generate diverse, high-fidelity sketches that more accurately reflect a wide range of artistic expressions.

MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention

TL;DR

Abstract

Paper Structure (33 sections, 11 equations, 25 figures, 4 tables)

This paper contains 33 sections, 11 equations, 25 figures, 4 tables.

Introduction
Related work
Sketch Extraction
Style Transfer
Preliminaries
Latent Diffusion Models
DDIM Inversion
Attention Mechanism in Stable Diffusion
Method
Mixture of Self-Attention
Decomposing Contours and Texture (DCT) Module
Sketch Style Interpolation
Target Style 1
Target Style 2
More Flexible Sketch Extraction
...and 18 more sections

Figures (25)

Figure 1: We propose MixSA, a training-free approach for extracting sketches from a color image using an input reference style image. Our model not only faithfully captures the input styles (left) but also allows interpolation between two styles to generate novel, unseen styles (right), exemplified by the Xieyi (Freehand) and Gongbi (Fine) styles.
Figure 2: Data-driven sketch extraction methods (a) struggle to adapt to unseen reference styles, while diffusion-based style transfer methods (b) fail to disentangle overall styles from sketch-specific styles, resulting in inconsistent sketch transfers. Our MixSA overcomes both the extensive training requirements and the sketch style transfer limitations.
Figure 3: The architecture of our proposed MixSA begins with a color image $z^c_0$ and its initial outlines $z^s_0$, both converted to latent representations via DDIM inversion. A reference sketch $z^r_0$ undergoes the same processing. The self-attention features ($Q_t$) from these latent representations are manipulated using the mixture-of-self-attention (MSA) and Decomposing the Contours and Texture (DCT) modules. These modified features are injected into the denoising U-Net, integrating the reference sketch's key and value features ($K_t^r$ and $V_t^r$) into the decoder. The final output $O$ is a high-fidelity sketch that faithfully matches the reference style.
Figure 4: Detailed mechanism of the mixture-of-self-attention module. The reference image's self-attention features ($Q_t^r, K_t^r, V_t^r$) and the initial sketch's self-attention features ($Q_t^m$) are computed, with $Q_t^m$ being a fusion of the color image and its initial outlines. The reference's key and value features ($K_t^r$ and $V_t^r$) replace those of the sketch in the self-attention mechanism. This ensures that the generated sketch incorporates the brushstroke styles of the reference image while maintaining the structure and layout of the initial outlines.
Figure 5: Illustration of $\zeta$ as the control axis for the degree of freehand (Xieyi) style. In the first row, with $\beta = 0$ (no texture), the scene's sketch is automatically extracted based on object contours. From left to right, the sketch becomes increasingly freehand as $\zeta$ increases, with brush stroke styles approaching those of the reference sketch. In the second row, $\beta$ decreases from left to right, reducing texture inclusion. Both object contours and textures influence the sketch extraction. As the degree of freehand style increases, the sketch transitions from detailed to more abstract representations.
...and 20 more figures

MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention

TL;DR

Abstract

MixSA: Training-free Reference-based Sketch Extraction via Mixture-of-Self-Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (25)