Table of Contents
Fetching ...

AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models

Bo Huang, Wenlun Xu, Qizhuo Han, Haodong Jing, Ying Li

TL;DR

AttenST tackles the high computational cost and content degradation of diffusion-based style transfer by introducing a training-free framework that leverages attention mechanisms. It combines four components—style-guided self-attention, style-preserving inversion, Content-Aware AdaIN, and Dual-Feature Cross-Attention—built on pre-trained diffusion models to balance content fidelity and stylistic expression. Empirical results on MS-COCO and WikiArt using SDXL show state-of-the-art FID, LPIPS, and ArtFID metrics, with ablations confirming the contribution of each component. The approach enables efficient, high-quality style transfer without fine-tuning, broadening the practical deployment of diffusion-based stylization.

Abstract

While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.

AttenST: A Training-Free Attention-Driven Style Transfer Framework with Pre-Trained Diffusion Models

TL;DR

AttenST tackles the high computational cost and content degradation of diffusion-based style transfer by introducing a training-free framework that leverages attention mechanisms. It combines four components—style-guided self-attention, style-preserving inversion, Content-Aware AdaIN, and Dual-Feature Cross-Attention—built on pre-trained diffusion models to balance content fidelity and stylistic expression. Empirical results on MS-COCO and WikiArt using SDXL show state-of-the-art FID, LPIPS, and ArtFID metrics, with ablations confirming the contribution of each component. The approach enables efficient, high-quality style transfer without fine-tuning, broadening the practical deployment of diffusion-based stylization.

Abstract

While diffusion models have achieved remarkable progress in style transfer tasks, existing methods typically rely on fine-tuning or optimizing pre-trained models during inference, leading to high computational costs and challenges in balancing content preservation with style integration. To address these limitations, we introduce AttenST, a training-free attention-driven style transfer framework. Specifically, we propose a style-guided self-attention mechanism that conditions self-attention on the reference style by retaining the query of the content image while substituting its key and value with those from the style image, enabling effective style feature integration. To mitigate style information loss during inversion, we introduce a style-preserving inversion strategy that refines inversion accuracy through multiple resampling steps. Additionally, we propose a content-aware adaptive instance normalization, which integrates content statistics into the normalization process to optimize style fusion while mitigating the content degradation. Furthermore, we introduce a dual-feature cross-attention mechanism to fuse content and style features, ensuring a harmonious synthesis of structural fidelity and stylistic expression. Extensive experiments demonstrate that AttenST outperforms existing methods, achieving state-of-the-art performance in style transfer dataset.

Paper Structure

This paper contains 24 sections, 10 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Pipeline of the AttenST. We start with the style-preserving inversion (\ref{['4.2']}) to invert content image $x^c_0$ and style image $x^s_0$, obtaining their respective latent noise representations, denoted as $X^c_T$ and $X^s_T$. During this process, the query of the content image $Q^{c}$ and the key-value pairs of the style image $(K^{s},V^{s})$ are extracted. Subsequently, the proposed CA-AdaIN mechanism (\ref{['4.3']}) is employed to refine the latent representation of the content, producing $x^{cs}_t$, which serves as the initial noise input for the UNet denoising process. Throughout denoising, the key and value derived from the self-attention of the style image are injected into the designated self-attention layers (\ref{['4.1']}), facilitating the integration of style features. Simultaneously, the features of the style and content images are processed through the DF-CA (\ref{['4.4']}) and incorporated into the corresponding blocks via cross-attention. This strategy constrains the generation process, ensuring effective style integration while preserving the original content, thereby achieving an optimal balance between style and content fidelity.
  • Figure 2: Qualitative results of style-guided self-attention mechanism application across different layers.
  • Figure 3: Style-preserving inversion process. We utilize the linear assumption to obtain $\hat{x}^1_{t}$, which provides a more accurate inversion direction compared to $x_{t-1}$. We then establish a refined inversion direction $-\epsilon_\theta(\hat{x}_t^1,t,c)$ by reversing the denoising trajectory from $\hat{x}^1_{t}$ to $x_{t-1}$, yielding a more precise estimated point $\hat{x}^2_{t}$.
  • Figure 4: Qualitative comparison with state-of-the-art methods.
  • Figure 5: Qualitative ablation study of our method.
  • ...and 8 more figures