Table of Contents
Fetching ...

U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin

TL;DR

U-StyDiT introduces a transformer-based diffusion framework for ultra-high quality artistic style transfer by learning content–style disentanglement. It couples a Multi-view Style Modulator (MSM) to capture global and local style cues with a StyDiT Block that jointly learns content and style conditions via transformer-based diffusion, incorporating Canny-guided conditioning. The authors also propose Aes4M, a large-scale dataset of 4 million high-quality 1024×1024 style images with clear Canny maps, to enable stable training of diffusion models at high resolutions. Quantitative and qualitative experiments show that U-StyDiT achieves superior stylization fidelity, preserving content structure while delivering rich, artifact-free stylistic details, outperforming state-of-the-art style transfer methods. The work highlights practical potential for high-fidelity artistic rendering and sets a foundation for diffusion-transformer-based style transfer at scale, while noting limitations in open-world style generalization and computational demands.

Abstract

Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.

U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers

TL;DR

U-StyDiT introduces a transformer-based diffusion framework for ultra-high quality artistic style transfer by learning content–style disentanglement. It couples a Multi-view Style Modulator (MSM) to capture global and local style cues with a StyDiT Block that jointly learns content and style conditions via transformer-based diffusion, incorporating Canny-guided conditioning. The authors also propose Aes4M, a large-scale dataset of 4 million high-quality 1024×1024 style images with clear Canny maps, to enable stable training of diffusion models at high resolutions. Quantitative and qualitative experiments show that U-StyDiT achieves superior stylization fidelity, preserving content structure while delivering rich, artifact-free stylistic details, outperforming state-of-the-art style transfer methods. The work highlights practical potential for high-fidelity artistic rendering and sets a foundation for diffusion-transformer-based style transfer at scale, while noting limitations in open-world style generalization and computational demands.

Abstract

Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.

Paper Structure

This paper contains 14 sections, 12 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Artistic style transfer results by the proposed U-StyDiT. Given a content and style image, our proposed U-StyDiT successfully produces ultra-high quality style transfer results that preserve the structure of the content image and the style information of the style image.
  • Figure 2: We present some stylized image examples generated by our proposed U-StyDiT, InstantX flux-ipa and CSGO xing2024csgo. Existing image stylization methods fail to create ultra-high quality stylized images, introducing obvious artifacts and disharmonious patterns.
  • Figure 3: Compared to Wikiart wikiart, Aes4M has clearer Canny images.
  • Figure 4: The training pipeline of our U-StyDiT. Given an artistic image from Aes4M with a resolution of $1024\times 1024$, we randomly crop multiple $512\times 512$ style patches $I_{ls}$. Then, we resize this artistic image to $512\times 512$, obtaining style image $I_{gs}$. We use Chatglm glm2024chatglm to extract text $I_{t}$ from $I_{gs}$. When extracting Canny maps from the $I_{gs}$, the high threshold is set to 200, and the low threshold is set to 100.
  • Figure 5: The details of Multi-view Style Modulator and StyDiT.
  • ...and 6 more figures