Table of Contents
Fetching ...

HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

Yeqi He, Liang Li, Zhiwen Yang, Xichun Sheng, Zhidong Zhao, Chenggang Yan

Abstract

Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via $\textbf{h}$eterogeneous $\textbf{a}$ttention $\textbf{m}$odulation ($\textbf{HAM}$) to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.

HAM: A Training-Free Style Transfer Approach via Heterogeneous Attention Modulation for Diffusion Models

Abstract

Diffusion models have demonstrated remarkable performance in image generation, particularly within the domain of style transfer. Prevailing style transfer approaches typically leverage pre-trained diffusion models' robust feature extraction capabilities alongside external modular control pathways to explicitly impose style guidance signals. However, these methods often fail to capture complex style reference or retain the identity of user-provided content images, thus falling into the trap of style-content balance. Thus, we propose a training-free style transfer approach via eterogeneous ttention odulation () to protect identity information during image/text-guided style reference transfer, thereby addressing the style-content trade-off challenge. Specifically, we first introduces style noise initialization to initialize latent noise for diffusion. Then, during the diffusion process, it innovatively employs HAM for different attention mechanisms, including Global Attention Regulation (GAR) and Local Attention Transplantation (LAT), which better preserving the details of the content image while capturing complex style references. Our approach is validated through a series of qualitative and quantitative experiments, achieving state-of-the-art performance on multiple quantitative metrics.
Paper Structure (24 sections, 7 equations, 6 figures, 5 tables)

This paper contains 24 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparative results of style transfer methods: Content image (top-left), style reference (bottom-left), baseline method (top-right), and our HAM (bottom-right). Red boxes denote significant identity retention disparities.
  • Figure 2: The overall pipeline of our method. Our proposed method consists of three main modules: global attention modulation, local attention transfer, and style injection noise initialization, which act on the self-attention, cross-attention, and noise initialization stages respectively. Through the joint modulation of the three modules, the final stylized image can retain more content identity information and capture and transfer complex style references.
  • Figure 3: Qualitative comparison with existing text-driven and image-driven SOTA methods. For fair evaluation, all methods use fixed random seeds: text-driven methods apply prompts directly, while image-driven methods generate style references via SD2.1 using identical prompts. Our HAM method better preserves content identity while maintaining style transfer semantics.
  • Figure 4: Qualitative results of our method HAM and the SOTA method are presented under different style references for the same content image. It can be observed that our method HAM has significant advantages in both style transfer and identity preservation.
  • Figure 5: Qualitative ablation study of different modules in our method. The indexes are consistent with those in the quantitative experiments.
  • ...and 1 more figures