Table of Contents
Fetching ...

All are Worth Words: A ViT Backbone for Diffusion Models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, Jun Zhu

TL;DR

This work introduces U-ViT, a ViT-based diffusion backbone that treats time, conditioning, and image patches as tokens and leverages long skip connections to preserve low-level details. The authors conduct systematic ablations to identify robust design choices, showing time-token inputs, concatenated long skips, and a post-projection 3×3 conv yield strong performance. Across unconditional, class-conditional, and text-to-image tasks, U-ViT matches or surpasses CNN-based U-Nets and achieves record FID scores in latent-diffusion setups (ImageNet-256x256: 2.29; MS-COCO: 5.48) without external data. The results suggest that long skip connections are crucial, while down/up-sampling found in U-Nets is not always necessary, highlighting ViT backbones as competitive backbones for diffusion models and guiding future cross-modality diffusion research.

Abstract

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.

All are Worth Words: A ViT Backbone for Diffusion Models

TL;DR

This work introduces U-ViT, a ViT-based diffusion backbone that treats time, conditioning, and image patches as tokens and leverages long skip connections to preserve low-level details. The authors conduct systematic ablations to identify robust design choices, showing time-token inputs, concatenated long skips, and a post-projection 3×3 conv yield strong performance. Across unconditional, class-conditional, and text-to-image tasks, U-ViT matches or surpasses CNN-based U-Nets and achieves record FID scores in latent-diffusion setups (ImageNet-256x256: 2.29; MS-COCO: 5.48) without external data. The results suggest that long skip connections are crucial, while down/up-sampling found in U-Nets is not always necessary, highlighting ViT backbones as competitive backbones for diffusion models and guiding future cross-modality diffusion research.

Abstract

Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets.
Paper Structure (17 sections, 3 equations, 14 figures, 7 tables)

This paper contains 17 sections, 3 equations, 14 figures, 7 tables.

Figures (14)

  • Figure 1: The U-ViT architecture for diffusion models, which is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing (#Blocks-1)/2 long skip connections between shallow and deep layers.
  • Figure 2: Ablate design choices. The one marked with * is the adopted choice of U-ViT illustrated in Figure \ref{['fig:uvit']}. Since this ablation aims to determine implementation details, we evaluate FID on 10K generated samples (instead of 50K samples for efficiency).
  • Figure 3: Effect of depth, width and patch size. The one marked with * corresponds to the setting of U-ViT-S/2 (see Table \ref{['tab:uvit_cfg']}).
  • Figure 4: Image generation results of U-ViT: selected samples on ImageNet 512$\times$512 and ImageNet 256$\times$256, and random samples on CIFAR10, CelebA 64$\times$64, and ImageNet 64$\times$64.
  • Figure 5: Ablate the long skip connection on ImageNet 256$\times$256 (w/o classifier-free guidance).
  • ...and 9 more figures