Table of Contents
Fetching ...

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang

TL;DR

CogView3 introduces relay diffusion to text-to-image generation, decomposing high-resolution output into a low-resolution base diffusion and a latent-space relaying super-resolution stage. Operating in latent space with a 3B UNet and a frozen T5-XXL text encoder, it leverages data re-captioning and prompt expansion to improve instruction-following, while employing progressive distillation to drastically cut inference time. Empirical results show CogView3 outperforms the open-source SDXL by 77% in human evaluations and halves inference time, with a distilled variant reaching comparable quality at 1/10 the time. The approach demonstrates substantial cost reductions for very high-resolution generation (2048×2048) and highlights practical improvements through data preprocessing and distillation within a relay-diffusion framework.

Abstract

Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

TL;DR

CogView3 introduces relay diffusion to text-to-image generation, decomposing high-resolution output into a low-resolution base diffusion and a latent-space relaying super-resolution stage. Operating in latent space with a 3B UNet and a frozen T5-XXL text encoder, it leverages data re-captioning and prompt expansion to improve instruction-following, while employing progressive distillation to drastically cut inference time. Empirical results show CogView3 outperforms the open-source SDXL by 77% in human evaluations and halves inference time, with a distilled variant reaching comparable quality at 1/10 the time. The approach demonstrates substantial cost reductions for very high-resolution generation (2048×2048) and highlights practical improvements through data preprocessing and distillation within a relay-diffusion framework.

Abstract

Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.
Paper Structure (34 sections, 16 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 16 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: Showcases of CogView3 generation of resolution $2048\times 2048$(top) and $1024\times 1024$(bottom). All prompts are sampled from Partiprompts yu2022scaling.
  • Figure 2: An example of re-caption data collection from GPT-4V.
  • Figure 3: (left) The pipeline of CogView3. User prompts are rewritten by a text-expansion language model. The base stage model generates $512\times 512$ images, and the second stage subsequently performs relaying super-resolution. (right) Formulation of relaying super-resolution in the latent space.
  • Figure 4: Results of human evaluation on DrawBench generation. (left) Comparison results about prompt alignment, (right) comparison results about aesthetic quality. "(expanded)" indicates that prompts used for generation is text-expanded.
  • Figure 5: Results of human evaluation on Drawbench generation for distilled models. (left) Comparison results about prompt alignment, (right) comparison results about aesthetic quality. "(expanded)" indicates that prompts used for generation is text-expanded. We sample 8+2 steps for CogView3-distill and 4 steps for LCM-SDXL.
  • ...and 7 more figures