GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models
Yusuf Dalva, Hidir Yesiltepe, Pinar Yanardag
TL;DR
GANTASTIC introduces a GAN-to-diffusion transfer framework that learns a diffusion-space latent direction $d$ from StyleGAN edits and applies it to a pre-trained diffusion model without finetuning, enabling disentangled, principled image edits across multiple domains. The method jointly optimizes latent and semantic alignment losses, $L = L_{sem} + L_{latent}$, using forward diffusion steps and CLIP-based signals to ensure edits are semantically meaningful and localized. Empirical results show competitive qualitative and quantitative performance against state-of-the-art diffusion-based and GAN-based editing approaches, with strong disentanglement and rapid zero-shot editing. While the approach enhances editing precision and cross-domain applicability, it relies on the quality of GAN directions and faces biases in CLIP and diffusion priors, underscoring the need for careful deployment and bias mitigation.
Abstract
The rapid advancement in image generation models has predominantly been driven by diffusion models, which have demonstrated unparalleled success in generating high-fidelity, diverse images from textual prompts. Despite their success, diffusion models encounter substantial challenges in the domain of image editing, particularly in executing disentangled edits-changes that target specific attributes of an image while leaving irrelevant parts untouched. In contrast, Generative Adversarial Networks (GANs) have been recognized for their success in disentangled edits through their interpretable latent spaces. We introduce GANTASTIC, a novel framework that takes existing directions from pre-trained GAN models-representative of specific, controllable attributes-and transfers these directions into diffusion-based models. This novel approach not only maintains the generative quality and diversity that diffusion models are known for but also significantly enhances their capability to perform precise, targeted image edits, thereby leveraging the best of both worlds.
