Table of Contents
Fetching ...

E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

Yifan Gong, Zheng Zhan, Qing Jin, Yanyu Li, Yerlan Idelbayev, Xian Liu, Andrey Zharkov, Kfir Aberman, Sergey Tulyakov, Yanzhi Wang, Jian Ren

TL;DR

Diffusion-based image editing is powerful but resource-intensive for mobile deployment. E^2GAN distills diffusion-model knowledge into a transferable base GAN and adapts to new concepts via LoRA-based fine-tuning on a focused set of layers, complemented by similarity-clustering data reduction to minimize training data. The approach achieves substantial reductions in training cost and storage while delivering competitive image quality and real-time on-device performance, demonstrating a practical path to democratize diffusion capabilities on edge devices. This work offers a framework for efficient knowledge transfer from large foundation models to lightweight, mobile-friendly architectures with broad implications for privacy, latency, and accessibility.

Abstract

One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques. First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkably reduced training and storage costs for each concept.

E$^{2}$GAN: Efficient Training of Efficient GANs for Image-to-Image Translation

TL;DR

Diffusion-based image editing is powerful but resource-intensive for mobile deployment. E^2GAN distills diffusion-model knowledge into a transferable base GAN and adapts to new concepts via LoRA-based fine-tuning on a focused set of layers, complemented by similarity-clustering data reduction to minimize training data. The approach achieves substantial reductions in training cost and storage while delivering competitive image quality and real-time on-device performance, demonstrating a practical path to democratize diffusion capabilities on edge devices. This work offers a framework for efficient knowledge transfer from large foundation models to lightweight, mobile-friendly architectures with broad implications for privacy, latency, and accessibility.

Abstract

One highly promising direction for enabling flexible real-time on-device image editing is utilizing data distillation by leveraging large-scale text-to-image diffusion models to generate paired datasets used for training generative adversarial networks (GANs). This approach notably alleviates the stringent requirements typically imposed by high-end commercial GPUs for performing image editing with diffusion models. However, unlike text-to-image diffusion models, each distilled GAN is specialized for a specific image editing task, necessitating costly training efforts to obtain models for various concepts. In this work, we introduce and address a novel research direction: can the process of distilling GANs from diffusion models be made significantly more efficient? To achieve this goal, we propose a series of innovative techniques. First, we construct a base GAN model with generalized features, adaptable to different concepts through fine-tuning, eliminating the need for training from scratch. Second, we identify crucial layers within the base GAN model and employ Low-Rank Adaptation (LoRA) with a simple yet effective rank search process, rather than fine-tuning the entire base model. Third, we investigate the minimal amount of data necessary for fine-tuning, further reducing the overall training time. Extensive experiments show that we can efficiently empower GANs with the ability to perform real-time high-quality image editing on mobile devices with remarkably reduced training and storage costs for each concept.
Paper Structure (29 sections, 1 equation, 15 figures, 14 tables, 1 algorithm)

This paper contains 29 sections, 1 equation, 15 figures, 14 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of E$^2$GAN.Left: Training Comparison. Conventional GAN training, such as pix2pix isola2017image and pix2pix-zero-distilled that distills Co-Mod-GAN zhao2021large using data from a diffusion model parmar2023zero, requires all the weights trained from scratch, while our efficient training significantly reduces the training cost by only fine-tuning $1\%$ weights with only portion of training data. Right: Mobile Inference Comparison. Our efficient on-device model can achieve real-time ($30$FPS, iPhone 14) runtime and is faster than pix2pix and diffusion model, while the pix2pix-zero-distilled model (Co-Mod-GAN) is not supported on device.
  • Figure 2: FID comparison of applying TBs in image generators trained on two datasets (Left:forest during autumn, Right:forest in the dusk). The vertical axis shows the position of inserting TBs. Pix2pix-zero-distilled uses pix2pix-zero for creating datasets to train Co-Mod-GAN ramesh2021zero.
  • Figure 3: Overview of E$^2$GAN model architecture. The generator is composed of down/up-sampling layers, 3 RBs, and 1 TB. The base generator is trained on multiple representative concepts. New concepts are achieved by fine-tuning LoRA parameters on crucial layers.
  • Figure 4: Crucial weights analysis via freezing partial weights in the base model. (a) Number of parameters for each part of the base model; (b) Averaged FID across $10$ different concepts on the Flicker-Scenery dataset when freezing partial weights of base model. '-' indicates fine-tuning all the weights; (c) The generated images when freezing each part of the base model.
  • Figure 5: Qualitative comparisons on various tasks. The leftmost column shows two original images and the remaining columns present the corresponding synthesized images in the target concept domain, where target prompts are shown at the bottom row. We provide images generated by various models.
  • ...and 10 more figures