Table of Contents
Fetching ...

WikiStyle+: A Multimodal Approach to Content-Style Representation Disentanglement for Artistic Image Stylization

Ma Zhuoqi, Zhang Yixuan, You Zejun, Tian Long, Liu Xiyang

TL;DR

This work tackles the challenge of disentangling content and style for artistic image stylization by introducing WikiStyle+, a multimodal dataset with artwork, content descriptions, and style descriptions. It proposes a content-style disentangled diffusion framework that uses two Q-Formers and learnable cross-attention layers to inject separate style and content representations into a frozen Stable Diffusion model, enabling multimodal inputs and reducing content leakage. Through contrastive learning losses and a multimodal training regimen, the method achieves superior style fidelity, content alignment, and qualitative disentanglement, with strong quantitative gains over state-of-the-art baselines. The approach demonstrates that explicit multimodal supervision and controlled cross-attention can yield more faithful and nuanced stylizations that align with artistic characteristics while supporting diverse input modalities.

Abstract

Artistic image stylization aims to render the content provided by text or image with the target style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in content leakage from the reference image. To address the above issues, this paper proposes a multimodal approach to content-style disentanglement for artistic image stylization. We construct a \textit{WikiStyle+} dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled representations-guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling more refined stylization that aligns with the artistic characteristics of the reference style. The code of our method will be available upon acceptance.

WikiStyle+: A Multimodal Approach to Content-Style Representation Disentanglement for Artistic Image Stylization

TL;DR

This work tackles the challenge of disentangling content and style for artistic image stylization by introducing WikiStyle+, a multimodal dataset with artwork, content descriptions, and style descriptions. It proposes a content-style disentangled diffusion framework that uses two Q-Formers and learnable cross-attention layers to inject separate style and content representations into a frozen Stable Diffusion model, enabling multimodal inputs and reducing content leakage. Through contrastive learning losses and a multimodal training regimen, the method achieves superior style fidelity, content alignment, and qualitative disentanglement, with strong quantitative gains over state-of-the-art baselines. The approach demonstrates that explicit multimodal supervision and controlled cross-attention can yield more faithful and nuanced stylizations that align with artistic characteristics while supporting diverse input modalities.

Abstract

Artistic image stylization aims to render the content provided by text or image with the target style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in content leakage from the reference image. To address the above issues, this paper proposes a multimodal approach to content-style disentanglement for artistic image stylization. We construct a \textit{WikiStyle+} dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled representations-guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling more refined stylization that aligns with the artistic characteristics of the reference style. The code of our method will be available upon acceptance.

Paper Structure

This paper contains 18 sections, 7 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Given a style reference image, our model can generate artistic images with refined stylization, effectively capturing the distinctive artistic characteristics of the intended style.
  • Figure 2: Examples from WikiStyle+ dataset, each item contains artwork, content text and style text.
  • Figure 3: Overview of our model that contains three parts: 1) a pre-trained image encoder; 2) a Content and Style Disentangled Network (CSDN) with a connection to a pre-trained Stable Diffusion (SD) model; 3) a learnable multi-step cross-attention layers (MCL) to separately inject the content and style features into the SD model.
  • Figure 4: Qualitative comparison with the state-of-the-art text-to-image stylization methods.
  • Figure 5: Qualitative results for content and style disentanglement.
  • ...and 2 more figures