Table of Contents
Fetching ...

Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

Shaohua Wu, Tong Yu, Shenling Wang, Xudong Zhao

TL;DR

This work introduces Yuan-TecSwin, a Swin-transformer–based, text-conditioned diffusion model that replaces CNN blocks with hierarchical, non-local processing to better capture long-range semantic relations. It couples TeC-Swin context-conditioned guidance via multi-layer M-CLIP embeddings with an adaptive, staged time-step search to refine the denoising process. Trained on a massive Chinese multimodal corpus and fine-tuned on human-labeled artworks, it achieves a 64x64 ImageNet FID of 1.37 (state-of-the-art) and strong MS-COCO performance with far fewer parameters than CNN-based rivals, while human evaluators struggle to distinguish generated from real images. The results demonstrate high perceptual realism and effective text alignment, highlighting the potential of Swin-based diffusion for efficient, high-quality text-to-image synthesis in multilingual contexts.

Abstract

Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

Yuan-TecSwin: A text conditioned Diffusion model with Swin-transformer blocks

TL;DR

This work introduces Yuan-TecSwin, a Swin-transformer–based, text-conditioned diffusion model that replaces CNN blocks with hierarchical, non-local processing to better capture long-range semantic relations. It couples TeC-Swin context-conditioned guidance via multi-layer M-CLIP embeddings with an adaptive, staged time-step search to refine the denoising process. Trained on a massive Chinese multimodal corpus and fine-tuned on human-labeled artworks, it achieves a 64x64 ImageNet FID of 1.37 (state-of-the-art) and strong MS-COCO performance with far fewer parameters than CNN-based rivals, while human evaluators struggle to distinguish generated from real images. The results demonstrate high perceptual realism and effective text alignment, highlighting the potential of Swin-based diffusion for efficient, high-quality text-to-image synthesis in multilingual contexts.

Abstract

Diffusion models have shown remarkable capacity in image synthesis based on their U-shaped architecture and convolutional neural networks (CNN) as basic blocks. The locality of the convolution operation in CNN may limit the model's ability to understand long-range semantic information. To address this issue, we propose Yuan-TecSwin, a text-conditioned diffusion model with Swin-transformer in this work. The Swin-transformer blocks take the place of CNN blocks in the encoder and decoder, to improve the non-local modeling ability in feature extraction and image restoration. The text-image alignment is improved with a well-chosen text encoder, effective utilization of text embedding, and careful design in the incorporation of text condition. Using an adapted time step to search in different diffusion stages, inference performance is further improved by 10%. Yuan-TecSwin achieves the state-of-the-art FID score of 1.37 on ImageNet generation benchmark, without any additional models at different denoising stages. In a side-by-side comparison, we find it difficult for human interviewees to tell the model-generated images from the human-painted ones.

Paper Structure

This paper contains 25 sections, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Sample images generated by Yuan-TecSwin.(a) A painting in Chinese-style of a tiger couching in the bush (画一张国画,蹲在草丛中的老虎); (b) Make a painting in Impressionism. Describing the sunrise over the sea. (画一幅印象派油画,描绘了海上日出的场景); (c) A wedding cake with three layers, decorated with roses. (一个三层的婚礼蛋糕,蛋糕上装点着玫瑰花); (d) A classic Chinese painting about a basket of fruits, (一幅中国画,关于一篮子水果); (e) An oil painting of a women with blonde curly hair (画一幅油画,一个女人,长着金色的卷发); (f) This is an photo of living room with green wall. There are two couches in the living room with a tea table in front of them. There are plants on the tea table. (这是一张客厅的照片,客厅的墙是绿色的,摆着两个沙发,沙发前面有一个玻璃茶几,沙发上放着白色和绿色的靠枕,茶几上放着绿色植物); (g) Make an ink and wash painting. Describing the mountains over clouds. (画一幅水墨画,描绘了云雾中的山峦); (h)Corridor interior view of a greenhouse with many flowers planted and sunlight streaming into the greenhouse (一座温室的走廊内景,温室里种了很多花,有阳光照进温室中来); (i) An oil painting of a golden kitty with blue eyes (画一幅油画,一只金色的猫咪,猫咪的眼睛是蓝色的).
  • Figure 2: Images in different styles generated with the same caption. (Yuan-TecSwin finetuned on the artworks dataset mentioned in III.D.
  • Figure 3: The overall architecture of the Swin-transformer Style base model
  • Figure 4: Impact of (a) cond-scale and (b) timestep on FID.