Table of Contents
Fetching ...

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

Guansong Lu, Yuanfan Guo, Jianhua Han, Minzhe Niu, Yihan Zeng, Songcen Xu, Zeyi Huang, Zhao Zhong, Wei Zhang, Hang Xu

TL;DR

PanGu-Draw tackles the resource-intensive nature of diffusion-based text-to-image synthesis by introducing Time-Decoupling Training, which splits a T2I diffusion model into a structure and a texture generator to boost data efficiency and halve training complexity, and Coop-Diffusion, a method to cooperatively fuse pre-trained diffusion models operating in different latent spaces and resolutions without additional data or retraining. The framework enables multi-control and multi-resolution image synthesis within a unified denoising process, and demonstrates strong English T2I performance (FID 7.99 on COCO) along with superior Chinese T2I metrics, while reducing data preparation by about 48% and total training resources by about 51%. A 5B multilingual PanGu-Draw model is released on the Ascend platform, with extensive ablations validating the effectiveness of the training strategy and the fusion mechanism. Collectively, the work offers a scalable path toward efficient, versatile diffusion models capable of integrating multiple controls and resolutions with limited retraining.

Abstract

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page: $\href{https://pangu-draw.github.io}{this~https~URL}$

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

TL;DR

PanGu-Draw tackles the resource-intensive nature of diffusion-based text-to-image synthesis by introducing Time-Decoupling Training, which splits a T2I diffusion model into a structure and a texture generator to boost data efficiency and halve training complexity, and Coop-Diffusion, a method to cooperatively fuse pre-trained diffusion models operating in different latent spaces and resolutions without additional data or retraining. The framework enables multi-control and multi-resolution image synthesis within a unified denoising process, and demonstrates strong English T2I performance (FID 7.99 on COCO) along with superior Chinese T2I metrics, while reducing data preparation by about 48% and total training resources by about 51%. A 5B multilingual PanGu-Draw model is released on the Ascend platform, with extensive ablations validating the effectiveness of the training strategy and the fusion mechanism. Collectively, the work offers a scalable path toward efficient, versatile diffusion models capable of integrating multiple controls and resolutions with limited retraining.

Abstract

Current large-scale diffusion models represent a giant leap forward in conditional image synthesis, capable of interpreting diverse cues like text, human poses, and edges. However, their reliance on substantial computational resources and extensive data collection remains a bottleneck. On the other hand, the integration of existing diffusion models, each specialized for different controls and operating in unique latent spaces, poses a challenge due to incompatible image resolutions and latent space embedding structures, hindering their joint use. Addressing these constraints, we present "PanGu-Draw", a novel latent diffusion model designed for resource-efficient text-to-image synthesis that adeptly accommodates multiple control signals. We first propose a resource-efficient Time-Decoupling Training Strategy, which splits the monolithic text-to-image model into structure and texture generators. Each generator is trained using a regimen that maximizes data utilization and computational efficiency, cutting data preparation by 48% and reducing training resources by 51%. Secondly, we introduce "Coop-Diffusion", an algorithm that enables the cooperative use of various pre-trained diffusion models with different latent spaces and predefined resolutions within a unified denoising process. This allows for multi-control image synthesis at arbitrary resolutions without the necessity for additional data or retraining. Empirical validations of Pangu-Draw show its exceptional prowess in text-to-image and multi-control image generation, suggesting a promising direction for future model training efficiencies and generation versatility. The largest 5B T2I PanGu-Draw model is released on the Ascend platform. Project page:
Paper Structure (17 sections, 6 equations, 16 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 6 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Illustration of three multi-stage training strategies and comparison between them in resource efficiency in data, training and inference aspects. Our time-decoupling training strategy significantly surpasses the representative methods in Cascaded Training shonenkov2023deepfloydnichol2021glide and Resolution Boost Training sdye2023altdiffusion in resource efficiency.
  • Figure 2: Visualization of our Coop-Diffusion algorithm for the cooperative integration of diverse pre-trained diffusion models. (a) Existing pre-trained diffusion models, each tailored for specific controls and operating within distinct latent spaces and image resolutions. (b) This sub-module bridges the gap arising from different latent spaces by transforming $\epsilon_t'$ in latent space B to the target latent space A as $\tilde{\epsilon}_t$. (c) This sub-module bridges the gap arising from different resolutions by performing upsampling on the predicted clean data $\hat{x}_{0,t}'$.
  • Figure 3: Results of fusing a low-resolution model and a high-resolution model with different upsampling methods. Upsampling from intermediate $z_t$ results in severe artifacts, while our upsampling algorithm results in high-fidelity image.
  • Figure 4: Images generated with PanGu-Draw, our 5B multi-lingual text-to-image generation model. PanGu-Draw is able to generate multi-resolution high-fidelity images semantically aligned with the input prompts.
  • Figure 5: Generation results of the fusing of an image variation model and PanGu-Draw and with the proposed Coop-Diffusion algorithm.
  • ...and 11 more figures