Table of Contents
Fetching ...

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

Jiwoo Chung, Sangeek Hyun, Jae-Pil Heo

TL;DR

This work introduces a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization that surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.

Abstract

Despite the impressive generative capabilities of diffusion models, existing diffusion model-based style transfer methods require inference-stage optimization (e.g. fine-tuning or textual inversion of style) which is time-consuming, or fails to leverage the generative ability of large-scale diffusion models. To address these issues, we introduce a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization. Specifically, we manipulate the features of self-attention layers as the way the cross-attention mechanism works; in the generation process, substituting the key and value of content with those of style image. This approach provides several desirable characteristics for style transfer including 1) preservation of content by transferring similar styles into similar image patches and 2) transfer of style based on similarity of local texture (e.g. edge) between content and style images. Furthermore, we introduce query preservation and attention temperature scaling to mitigate the issue of disruption of original content, and initial latent Adaptive Instance Normalization (AdaIN) to deal with the disharmonious color (failure to transfer the colors of style). Our experimental results demonstrate that our proposed method surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.

Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer

TL;DR

This work introduces a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization that surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.

Abstract

Despite the impressive generative capabilities of diffusion models, existing diffusion model-based style transfer methods require inference-stage optimization (e.g. fine-tuning or textual inversion of style) which is time-consuming, or fails to leverage the generative ability of large-scale diffusion models. To address these issues, we introduce a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization. Specifically, we manipulate the features of self-attention layers as the way the cross-attention mechanism works; in the generation process, substituting the key and value of content with those of style image. This approach provides several desirable characteristics for style transfer including 1) preservation of content by transferring similar styles into similar image patches and 2) transfer of style based on similarity of local texture (e.g. edge) between content and style images. Furthermore, we introduce query preservation and attention temperature scaling to mitigate the issue of disruption of original content, and initial latent Adaptive Instance Normalization (AdaIN) to deal with the disharmonious color (failure to transfer the colors of style). Our experimental results demonstrate that our proposed method surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines.
Paper Structure (18 sections, 10 equations, 19 figures, 6 tables)

This paper contains 18 sections, 10 equations, 19 figures, 6 tables.

Figures (19)

  • Figure 1: Manipulation of self-attention features for style transfer. (a) General self-attention (SA) deploys the query, key, and value features from a single image in both the training and inference phases. (b) At inference phase, we suggest that manipulating features of self-attention of pre-trained large-scale DM is an effective way to transfer the styles; injection of key and value of styles into SA of contents is a proper way for transferring styles. As a result, style-injected content $z^c_{t-1}$ would maintain contents while modifying its style to resemble the target style.
  • Figure 2: Desirable attributes of self-attention (SA) for style transfer. (a) Visualization of query by PCA shows that query features well-reflect similarities among patches. That is, style transfer employing SA can preserve the original content, as content patches with similarities tend to receive similar attention scores from a corresponding style image patch. (b) We visualize a similarity map between the blue box (edge) query of the content image, and key of the style image. Thanks to the features representation of large-scale DM encompassing texture and semantics, a query exhibits higher similarity to keys that share a similar style, such as edges.
  • Figure 3: Overall framework. (Left) Illustration for the proposed style transfer method. We first invert content image $z^c_0$ and style image $z^s_0$ into the latent noise space as $z^c_T$ and $z^s_T$, respectively. Then, we initialize the initial noise of stylized image $z_T^{cs}$ from initial latent AdaIN (Sec. \ref{['sec_init_adain']}) which combines the content and style noise, $z_T^c$ and $z_T^s$. While performing the reverse diffusion process with $z^{cs}_T$, we inject the information of content and style by attention-based style injection (Sec. \ref{['sec_attn_fusion']}) and attention temperature scaling (Sec. \ref{['sec_attn_rev_tem']}). (Right) Detailed explanation of style injection and initial noise AdaIN. Style injection is basically the manipulation of self-attention (SA) layer during the reverse diffusion process. Specifically, at time step $t$, we substitute the key ($K^{cs}_t$) and value ($V^{cs}_t$) in SA of stylized image with those of style features, $K^s_t$ and $V^s_t$, from identical timestep $t$. At the same time, we preserve the content information by blending the query of content $Q^c_t$ and query of stylized image $Q^{cs}_t$. Finally, we scale the magnitude of the attention map to deal with the magnitude decrease that the substitution of feature leads to. Initial latent AdaIN produces the initial noise $z_T^{cs}$ by combining style noise $z_T^s$ and content noise $z_T^s$. Specifically, we modify the channel statistics of $z_T^c$ to resemble the statistics of $z_T^s$ and regard it as $z_T^{cs}$. We observe this operation enables us to keep the spatial layout of content image while well-reflecting the color tones of a given style image.
  • Figure 4: Visualization of the standard deviation of attention map before softmax. (a) Attention-based style injection reduces the standard deviation of self-attention map. Original denotes SA maps from the generation process without style injection. We use both style and content images for generation. (b) We compute the ratio between attention maps w/ and w/o style injection. For the std of original image, we use averaged std. of content and style.
  • Figure 5: Generated results only w/ style injection. (a) We observe that generated images only with attention-based style injection do not harmonize with the given style in the aspect of color tone. (b) To identify the effects of every feature in SA on color tones, we additionally include query in the style injection process. However, color tones still resemble those of content, concluding features in self-attention have less effect on the color tones.
  • ...and 14 more figures