Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks
Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao
TL;DR
This work introduces Yuan-TecSwin, a Swin-transformer–based, text-conditioned diffusion model that replaces CNN blocks with hierarchical, non-local processing to better capture long-range semantic relations. It couples TeC-Swin context-conditioned guidance via multi-layer M-CLIP embeddings with an adaptive, staged time-step search to refine the denoising process. Trained on a massive Chinese multimodal corpus and fine-tuned on human-labeled artworks, it achieves a 64x64 ImageNet FID of 1.37 (state-of-the-art) and strong MS-COCO performance with far fewer parameters than CNN-based rivals, while human evaluators struggle to distinguish generated from real images. The results demonstrate high perceptual realism and effective text alignment, highlighting the potential of Swin-based diffusion for efficient, high-quality text-to-image synthesis in multilingual contexts.
Abstract
Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.
