Table of Contents
Fetching ...

CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

Nisha Huang, Weiming Dong, Yuxin Zhang, Fan Tang, Ronghui Li, Chongyang Ma, Xiu Li, Tong-Yee Lee, Changsheng Xu

TL;DR

CreativeSynth addresses the challenge that painting semantics and composition are not well captured by style transfer or generic text-to-image models. It introduces Cross-Art-Attention within a diffusion-based multimodal framework to fuse semantic cues from images and text with a target artwork, preserving aesthetics while enabling editing tasks. Key contributions include decoupled cross-attention for multimodal fusion, ArtBN-based style alignment, semantic preservation through inversion-guided sampling, and extensive experiments across image variation, editing, fusion, and multimodal blending. The approach enables high-fidelity artistic synthesis without retraining, with broad applicability to artistic creation and style-consistent editing.

Abstract

Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression. Code and results are available at https://github.com/haha-lisa/CreativeSynth.

CreativeSynth: Cross-Art-Attention for Artistic Image Synthesis with Multimodal Diffusion

TL;DR

CreativeSynth addresses the challenge that painting semantics and composition are not well captured by style transfer or generic text-to-image models. It introduces Cross-Art-Attention within a diffusion-based multimodal framework to fuse semantic cues from images and text with a target artwork, preserving aesthetics while enabling editing tasks. Key contributions include decoupled cross-attention for multimodal fusion, ArtBN-based style alignment, semantic preservation through inversion-guided sampling, and extensive experiments across image variation, editing, fusion, and multimodal blending. The approach enables high-fidelity artistic synthesis without retraining, with broad applicability to artistic creation and style-consistent editing.

Abstract

Although remarkable progress has been made in image style transfer, style is just one of the components of artistic paintings. Directly transferring extracted style features to natural images often results in outputs with obvious synthetic traces. This is because key painting attributes including layout, perspective, shape, and semantics often cannot be conveyed and expressed through style transfer. Large-scale pretrained text-to-image generation models have demonstrated their capability to synthesize a vast amount of high-quality images. However, even with extensive textual descriptions, it is challenging to fully express the unique visual properties and details of paintings. Moreover, generic models often disrupt the overall artistic effect when modifying specific areas, making it more complicated to achieve a unified aesthetic in artworks. Our main novel idea is to integrate multimodal semantic information as a synthesis guide into artworks, rather than transferring style to the real world. We also aim to reduce the disruption to the harmony of artworks while simplifying the guidance conditions. Specifically, we propose an innovative multi-task unified framework called CreativeSynth, based on the diffusion model with the ability to coordinate multimodal inputs. CreativeSynth combines multimodal features with customized attention mechanisms to seamlessly integrate real-world semantic content into the art domain through Cross-Art-Attention for aesthetic maintenance and semantic fusion. We demonstrate the results of our method across a wide range of different art categories, proving that CreativeSynth bridges the gap between generative models and artistic expression. Code and results are available at https://github.com/haha-lisa/CreativeSynth.
Paper Structure (38 sections, 15 equations, 16 figures, 3 tables)

This paper contains 38 sections, 15 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Our CreativeSynth unified framework is capable of generating personalized digital art when supplied with an art image, drawing on prompts from either unimodal or multimodal prompts. This methodology not only yields artwork with high-fidelity realism but also effectively upholds the foundational concepts, composition, stylistic elements, and visual symbolism intrinsic to genuine artworks. CreativeSynth supports a wide array of intriguing applications, including (a) image variation, (b) image editing, (c) style transfer, (d) image fusion, and (e) multimodal blending.
  • Figure 2: The conceptual differences among the three image generation methods. (a) Classical Style TransferHuang:2017:AdaIn, which combines a style image (providing the desired artistic style) and a content image (providing the scene or structure). (b) Text-to-Image Synthesissdxl, which generates an image directly from random noise guided by textual descriptions, such as "a beautiful lady reading a magazine in the style of oil paintings", without requiring a reference style image. (c) CreativeSynth, which employs a "cross-art-attention" mechanism to seamlessly integrate semantic content with the desired style, producing outputs that are both semantically coherent and stylistically consistent.
  • Figure 3: The overall structure of CreativeSynth. Text features and image features are first acquired from separate text and image encoders, respectively. Then, target and semantic images are interacted by applying AdaIN to focus on image art features. An innovative decoupled cross-attention mechanism is employed to fuse the attention between the multimodal inputs, which is subsequently integrated into a U-Net architecture. The target image is transformed into a latent variable $z_T$ via DDIM Inversion, and the final output is refined through a denoising network.
  • Figure 4: Qualitative comparisons of our proposed CreativeSynth with other extant methods. The results offer a visualization of image fusion between artistic and real images.
  • Figure 5: Visual comparison of our proposed CreativeSynth with state-of-the-art methods for text-guided editing of diverse types of art images.
  • ...and 11 more figures