Table of Contents
Fetching ...

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Ziji Shi, Jialin Li, Yang You

TL;DR

ParaGAN is introduced, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training and enables unprecedented high-resolution image generation using BigGAN.

Abstract

Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

TL;DR

ParaGAN is introduced, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training and enables unprecedented high-resolution image generation using BigGAN.

Abstract

Recent advances in Generative Artificial Intelligence have fueled numerous applications, particularly those involving Generative Adversarial Networks (GANs), which are essential for synthesizing realistic photos and videos. However, efficiently training GANs remains a critical challenge due to their computationally intensive and numerically unstable nature. Existing methods often require days or even weeks for training, posing significant resource and time constraints. In this work, we introduce ParaGAN, a scalable distributed GAN training framework that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training. ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput. With ParaGAN, we reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN.

Paper Structure

This paper contains 33 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: ParaGAN scales to 1024 TPU accelerators at 91% scaling efficiency.
  • Figure 2: Typical GAN architecture.
  • Figure 3: Overview of ParaGAN.
  • Figure 4: Operator usage profile when training at scale.
  • Figure 5: Synchronous update and asynchronous update scheme in ParaGAN. $G_t$ and $D_t$ are the model weights of the generator and discriminator at iteration t.
  • ...and 8 more figures