WikiStyle+: A Multimodal Approach to Content-Style Representation Disentanglement for Artistic Image Stylization
Ma Zhuoqi, Zhang Yixuan, You Zejun, Tian Long, Liu Xiyang
TL;DR
This work tackles the challenge of disentangling content and style for artistic image stylization by introducing WikiStyle+, a multimodal dataset with artwork, content descriptions, and style descriptions. It proposes a content-style disentangled diffusion framework that uses two Q-Formers and learnable cross-attention layers to inject separate style and content representations into a frozen Stable Diffusion model, enabling multimodal inputs and reducing content leakage. Through contrastive learning losses and a multimodal training regimen, the method achieves superior style fidelity, content alignment, and qualitative disentanglement, with strong quantitative gains over state-of-the-art baselines. The approach demonstrates that explicit multimodal supervision and controlled cross-attention can yield more faithful and nuanced stylizations that align with artistic characteristics while supporting diverse input modalities.
Abstract
Artistic image stylization aims to render the content provided by text or image with the target style, where content and style decoupling is the key to achieve satisfactory results. However, current methods for content and style disentanglement primarily rely on image supervision, which leads to two problems: 1) models can only support one modality for style or content input;2) incomplete disentanglement resulting in content leakage from the reference image. To address the above issues, this paper proposes a multimodal approach to content-style disentanglement for artistic image stylization. We construct a \textit{WikiStyle+} dataset consists of artworks with corresponding textual descriptions for style and content. Based on the multimodal dataset, we propose a disentangled representations-guided diffusion model. The disentangled representations are first learned by Q-Formers and then injected into a pre-trained diffusion model using learnable multi-step cross-attention layers. Experimental results show that our method achieves a thorough disentanglement of content and style in reference images under multimodal supervision, thereby enabling more refined stylization that aligns with the artistic characteristics of the reference style. The code of our method will be available upon acceptance.
