U-StyDiT: Ultra-high Quality Artistic Style Transfer Using Diffusion Transformers
Zhanjie Zhang, Ao Ma, Ke Cao, Jing Wang, Shanyuan Liu, Yuhang Ma, Bo Cheng, Dawei Leng, Yuhui Yin
TL;DR
U-StyDiT introduces a transformer-based diffusion framework for ultra-high quality artistic style transfer by learning content–style disentanglement. It couples a Multi-view Style Modulator (MSM) to capture global and local style cues with a StyDiT Block that jointly learns content and style conditions via transformer-based diffusion, incorporating Canny-guided conditioning. The authors also propose Aes4M, a large-scale dataset of 4 million high-quality 1024×1024 style images with clear Canny maps, to enable stable training of diffusion models at high resolutions. Quantitative and qualitative experiments show that U-StyDiT achieves superior stylization fidelity, preserving content structure while delivering rich, artifact-free stylistic details, outperforming state-of-the-art style transfer methods. The work highlights practical potential for high-fidelity artistic rendering and sets a foundation for diffusion-transformer-based style transfer at scale, while noting limitations in open-world style generalization and computational demands.
Abstract
Ultra-high quality artistic style transfer refers to repainting an ultra-high quality content image using the style information learned from the style image. Existing artistic style transfer methods can be categorized into style reconstruction-based and content-style disentanglement-based style transfer approaches. Although these methods can generate some artistic stylized images, they still exhibit obvious artifacts and disharmonious patterns, which hinder their ability to produce ultra-high quality artistic stylized images. To address these issues, we propose a novel artistic image style transfer method, U-StyDiT, which is built on transformer-based diffusion (DiT) and learns content-style disentanglement, generating ultra-high quality artistic stylized images. Specifically, we first design a Multi-view Style Modulator (MSM) to learn style information from a style image from local and global perspectives, conditioning U-StyDiT to generate stylized images with the learned style information. Then, we introduce a StyDiT Block to learn content and style conditions simultaneously from a style image. Additionally, we propose an ultra-high quality artistic image dataset, Aes4M, comprising 10 categories, each containing 400,000 style images. This dataset effectively solves the problem that the existing style transfer methods cannot produce high-quality artistic stylized images due to the size of the dataset and the quality of the images in the dataset. Finally, the extensive qualitative and quantitative experiments validate that our U-StyDiT can create higher quality stylized images compared to state-of-the-art artistic style transfer methods. To our knowledge, our proposed method is the first to address the generation of ultra-high quality stylized images using transformer-based diffusion.
