Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

Yanqi Ge; Jiaqi Liu; Qingnan Fan; Xi Jiang; Ye Huang; Shuai Qin; Hong Gu; Wen Li; Lixin Duan

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

Yanqi Ge, Jiaqi Liu, Qingnan Fan, Xi Jiang, Ye Huang, Shuai Qin, Hong Gu, Wen Li, Lixin Duan

TL;DR

This work tackles text-driven style transfer in diffusion-based T2I models, where prompt-level style injection often distorts content structure. It introduces Adaptive Style Incorporation (ASI), combining Siamese Cross-Attention (SiCA) for dual-content/style feature extraction and Adaptive Content-Style Blending (AdaBlending) with mask-guided, structure-aware fusion, all without model tuning. The approach yields superior structure preservation and stylization across real and generated images, validated through qualitative and quantitative evaluations and extensive ablations. The method promises practical utility for professional editing by enabling precise, locality-aware style transfer while maintaining semantic integrity, though it notes inversion limitations and increased computation from covariance-based masking.

Abstract

In this work, we target the task of text-driven style transfer in the context of text-to-image (T2I) diffusion models. The main challenge is consistent structure preservation while enabling effective style transfer effects. The past approaches in this field directly concatenate the content and style prompts for a prompt-level style injection, leading to unavoidable structure distortions. In this work, we propose a novel solution to the text-driven style transfer task, namely, Adaptive Style Incorporation~(ASI), to achieve fine-grained feature-level style incorporation. It consists of the Siamese Cross-Attention~(SiCA) to decouple the single-track cross-attention to a dual-track structure to obtain separate content and style features, and the Adaptive Content-Style Blending (AdaBlending) module to couple the content and style information from a structure-consistent manner. Experimentally, our method exhibits much better performance in both structure preservation and stylized effects.

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

TL;DR

Abstract

Paper Structure (19 sections, 10 equations, 12 figures, 4 tables)

This paper contains 19 sections, 10 equations, 12 figures, 4 tables.

Introduction
Related Work
Diffusion Models
Style Transfer
Preliminaries
Proposed Method
Siamese Cross-Attention
Adaptive Content-Style Blending
Mask Extractor on Attention-Head Level
Mask Extractor on Spatial Level
Experiments
Implementation Details
Qualitative Comparison
Style Transfer
Visual Enhancement
...and 4 more sections

Figures (12)

Figure 1: We propose Adaptive Style Incorporation (ASI), a tuning-free diffusion-based style transfer method that enables versatile text-guided stylization for the source image. Our stylized results exhibit high consistency to the structure and semantics of the source image, while significantly changing their image style following the style prompt.
Figure 2: (a) Due to the text-image misalignment in T2I models, directly concatenating the content prompt with the style prompt (i.e., prompt-level coarse-grained style injection) will introduce non-style information into the style transfer process, resulting in unavoidable structural and semantic drift in the stylized image, such as the hair of the boy and the pattern on his clothes. (b) We propose Adaptive Style Incorporation (ASI), which consists of siamese cross-attention and adaptive content style blending modules. ASI explicitly parses style information that does not disrupt the structure of content features and incorporates it into the content features to achieve feature-level fine-grained style incorporation.
Figure 3: Overall architecture of our proposed Adaptive Content-Style Blending (AdaBlending) module. AdaBlending achieves mask-guided fine-grained style incorporation through the proposed attention-head level and spatial level mask extractors, followed by the content-style blending operation in the structure-consistent manner.
Figure 4: Visualization of the top three leading components of average cross-attention features of the U-Net.
Figure 5: Sample results of our method for transferring both photography and artistic styles. Zoom in for the best view.
...and 7 more figures

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

TL;DR

Abstract

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

Authors

TL;DR

Abstract

Table of Contents

Figures (12)