Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation
Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn
TL;DR
This work addresses multi-modal face image generation by bridging diffusion models and pre-trained GANs. It introduces a diffusion encoder $\mathcal{E}$, a Mapping Network $\mathcal{M}$, and an Attention-based Style Modulation Network $\mathcal{T}$ to produce GAN latents $w_t$ from diffusion features $h_t$, $f_t$, and $a_t$, enabling conditional 2D and 3D-aware face synthesis from text $c$ and visual inputs $x$. Through multi-denoising-step training, the method jointly optimizes $w^m_t$, $w^\gamma_t$, and $w^\beta_t$ so that $w_t = w^m_t \odot w^\gamma_t \oplus w^\beta_t$ yields high-fidelity, input-consistent images via a fixed GAN $\mathcal{G}$. Experiments on CelebAMask-HQ show the approach outperforms existing GAN- and diffusion-based baselines in both 2D and 3D settings, demonstrating strong semantic alignment with inputs and robust multi-modal control. The technique offers a practical path to controllable, photorealistic face synthesis and style transfer across modalities without extra data or loss terms.
Abstract
We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.
