Table of Contents
Fetching ...

ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model

Chengming Xu, Kai Hu, Qilin Wang, Donghao Luo, Jiangning Zhang, Xiaobin Hu, Yanwei Fu, Chengjie Wang

TL;DR

ArtWeaver is presented, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics, and introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter.

Abstract

Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. In this paper, we present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics. Our approach introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter. The mixed style descriptor enhances SD by combining content-aware and frequency-disentangled embeddings from CLIP with additional sources that capture global statistics and textual information, thus providing a richer blend of style-related and semantic-related knowledge. To achieve a better balance between adapter capacity and semantic control, the dynamic attention adapter is integrated into the diffusion UNet, dynamically calculating adaptation weights based on the style descriptors. Additionally, we introduce two objective functions to optimize the model alongside the denoising loss, further enhancing semantic and style consistency. Extensive experiments demonstrate the superiority of ArtWeaver over existing methods, producing images with diverse target styles while maintaining the semantic integrity of the text prompts.

ArtWeaver: Advanced Dynamic Style Integration via Diffusion Model

TL;DR

ArtWeaver is presented, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics, and introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter.

Abstract

Stylized Text-to-Image Generation (STIG) aims to generate images from text prompts and style reference images. In this paper, we present ArtWeaver, a novel framework that leverages pretrained Stable Diffusion (SD) to address challenges such as misinterpreted styles and inconsistent semantics. Our approach introduces two innovative modules: the mixed style descriptor and the dynamic attention adapter. The mixed style descriptor enhances SD by combining content-aware and frequency-disentangled embeddings from CLIP with additional sources that capture global statistics and textual information, thus providing a richer blend of style-related and semantic-related knowledge. To achieve a better balance between adapter capacity and semantic control, the dynamic attention adapter is integrated into the diffusion UNet, dynamically calculating adaptation weights based on the style descriptors. Additionally, we introduce two objective functions to optimize the model alongside the denoising loss, further enhancing semantic and style consistency. Extensive experiments demonstrate the superiority of ArtWeaver over existing methods, producing images with diverse target styles while maintaining the semantic integrity of the text prompts.
Paper Structure (27 sections, 8 equations, 18 figures, 2 tables)

This paper contains 27 sections, 8 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Overview of our proposed ArtWeaver. Concretely, the Mixed Style Descriptor (MSD) (Sec. \ref{['sec:style']}) aggregates style patterns contained in each reference image simultaneously from global and local levels. Style embeddings are then engaged in the denoising procedure through the proposed Dynamic Attention Adaptation (DAA) (Sec. \ref{['sec:attention']}), which guides both the attentions in diffusion UNet to properly merge style and semantic information from different sources. During training augmented input images are used as style reference images, through which objectives as described in Sec. \ref{['sec:loss']} are used to optimize the model. During inference the style reference images are achieved with manual assignment instead of using augmentation of a specific image.
  • Figure 2: The structure of the Mixed Style Descriptor (MSD)
  • Figure 3: The structure of the Dynamic Attention Adapter (DAA)
  • Figure 4: One-shot qualitative comparison with SD1.5 as backbone. For comparison with SDXL as backbone, please refer to the supplementary material. Zoom in for more details.
  • Figure 5: Multi-shot qualitative comparison with SD1.5 as backbone. For detailed reference images and comparison with SDXL as backbone, please refer to the supplementary material. Zoom in for more details.
  • ...and 13 more figures