Table of Contents
Fetching ...

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

Yusuf Dalva, Hidir Yesiltepe, Pinar Yanardag

TL;DR

GANTASTIC introduces a GAN-to-diffusion transfer framework that learns a diffusion-space latent direction $d$ from StyleGAN edits and applies it to a pre-trained diffusion model without finetuning, enabling disentangled, principled image edits across multiple domains. The method jointly optimizes latent and semantic alignment losses, $L = L_{sem} + L_{latent}$, using forward diffusion steps and CLIP-based signals to ensure edits are semantically meaningful and localized. Empirical results show competitive qualitative and quantitative performance against state-of-the-art diffusion-based and GAN-based editing approaches, with strong disentanglement and rapid zero-shot editing. While the approach enhances editing precision and cross-domain applicability, it relies on the quality of GAN directions and faces biases in CLIP and diffusion priors, underscoring the need for careful deployment and bias mitigation.

Abstract

The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits-changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models-representative of specific, controllable attributes-and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to perform precise, targeted image edits, thereby leveraging the best of both worlds.

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

TL;DR

GANTASTIC introduces a GAN-to-diffusion transfer framework that learns a diffusion-space latent direction from StyleGAN edits and applies it to a pre-trained diffusion model without finetuning, enabling disentangled, principled image edits across multiple domains. The method jointly optimizes latent and semantic alignment losses, , using forward diffusion steps and CLIP-based signals to ensure edits are semantically meaningful and localized. Empirical results show competitive qualitative and quantitative performance against state-of-the-art diffusion-based and GAN-based editing approaches, with strong disentanglement and rapid zero-shot editing. While the approach enhances editing precision and cross-domain applicability, it relies on the quality of GAN directions and faces biases in CLIP and diffusion priors, underscoring the need for careful deployment and bias mitigation.

Abstract

The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits-changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models-representative of specific, controllable attributes-and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to perform precise, targeted image edits, thereby leveraging the best of both worlds.
Paper Structure (23 sections, 9 equations, 14 figures, 2 tables)

This paper contains 23 sections, 9 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: GANTASTIC is a novel framework that transfers interpretable directions from pre-trained GAN models directly into diffusion-based models to enable disentangled and controllable image editing.
  • Figure 2: GANTASTIC framework. After generating a set of $N$ images using StyleGAN, denoted as $G(s)$, and their edited versions, denoted as $G(s + \Delta s)$, our framework learns a latent direction $d$ that reflects the edits introduced by $\Delta s$ (e.g. beard) to the pre-trained diffusion model. To effectively learn such a latent direction, we utilize both the denoising network used by the diffusion model, and the CLIP radford2021learning Image Encoder.
  • Figure 3: Capabilities of GANTASTIC. The proposed framework can successfully learn latent directions from a variety of domains including human faces and dog images. Additionally, GANTASTIC enables users to adjust the intensity of the editing effect through a scaling parameter. This functionality gives users the flexibility to either tone down or intensify the impact of a given editing direction. For instance, in the case of the Gender edit, users can lessen the effect for a more masculine appearance or enhance it for a more feminine look by applying a negative or positive scale, respectively.
  • Figure 4: Qualitative Results. GANTASTIC successfully transfers editing directions that modify the overall look, including changes in race or aging, as well as more detailed edits that target specific facial attributes, such as eyeglasses or a beard. GANTASTIC can also distinguish among various edits for the same feature underlines the versatility of our approach, providing users with an extensive selection of editing options for individual characteristics, like multiple smile designs (see row 2) or styles of baldness (as shown in Rows 1 and 2).
  • Figure 5: Qualitative Comparison with Diffusion-based Image Editing Methods We compare our approach with Concept Sliders gandikota2023sliders, SEGA brack2023sega, Cycle-Diffusion wu2023latent. The qualitative outcomes demonstrate that GANTASTIC outperforms the aforementioned methods in achieving disentangled image edits and in identifying detailed latent directions.
  • ...and 9 more figures