Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Jihyun Kim; Changjae Oh; Hoseok Do; Soohyun Kim; Kwanghoon Sohn

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Jihyun Kim, Changjae Oh, Hoseok Do, Soohyun Kim, Kwanghoon Sohn

TL;DR

This work addresses multi-modal face image generation by bridging diffusion models and pre-trained GANs. It introduces a diffusion encoder $\mathcal{E}$, a Mapping Network $\mathcal{M}$, and an Attention-based Style Modulation Network $\mathcal{T}$ to produce GAN latents $w_t$ from diffusion features $h_t$, $f_t$, and $a_t$, enabling conditional 2D and 3D-aware face synthesis from text $c$ and visual inputs $x$. Through multi-denoising-step training, the method jointly optimizes $w^m_t$, $w^\gamma_t$, and $w^\beta_t$ so that $w_t = w^m_t \odot w^\gamma_t \oplus w^\beta_t$ yields high-fidelity, input-consistent images via a fixed GAN $\mathcal{G}$. Experiments on CelebAMask-HQ show the approach outperforms existing GAN- and diffusion-based baselines in both 2D and 3D settings, demonstrating strong semantic alignment with inputs and robust multi-modal control. The technique offers a practical path to controllable, photorealistic face synthesis and style transfer across modalities without extra data or loss terms.

Abstract

We present a new multi-modal face image generation method that converts a text prompt and a visual input, such as a semantic mask or scribble map, into a photo-realistic face image. To do this, we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion, the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D, multi-view, and stylized face images, which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs, and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusion-driven_GAN-Inversion/.

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

TL;DR

This work addresses multi-modal face image generation by bridging diffusion models and pre-trained GANs. It introduces a diffusion encoder

, a Mapping Network

, and an Attention-based Style Modulation Network

to produce GAN latents

from diffusion features

, and

, enabling conditional 2D and 3D-aware face synthesis from text

and visual inputs

. Through multi-denoising-step training, the method jointly optimizes

, and

so that

yields high-fidelity, input-consistent images via a fixed GAN

. Experiments on CelebAMask-HQ show the approach outperforms existing GAN- and diffusion-based baselines in both 2D and 3D settings, demonstrating strong semantic alignment with inputs and robust multi-modal control. The technique offers a practical path to controllable, photorealistic face synthesis and style transfer across modalities without extra data or loss terms.

Abstract

Paper Structure (16 sections, 6 equations, 10 figures, 2 tables)

This paper contains 16 sections, 6 equations, 10 figures, 2 tables.

Introduction
Related Work
GAN Inversion
Diffusion Model for Image Generation
Multi-Modal Face Image Generation
Method
Overview
Mapping Network
Attention-based Style Modulation Network
Loss Functions
Experiments
Experimental Setup
Results
Ablation Study
Limitations and Future Works
...and 1 more sections

Figures (10)

Figure 1: We present a method to map the diffusion features to the latent space of a pre-trained GAN, which enables diverse tasks in multi-modal face image generation and style transfer. Our method can be applied to 2D and 3D-aware face image generation.
Figure 2: Overview of our method. We use a diffusion-based encoder $\mathcal{E}$, the middle and decoder blocks of a denoising U-Net, that extracts the semantic features $\mathbf{h}_t$, intermediate features $\mathbf{f}_t$, and cross-attention maps $\mathbf{a}_t$ at denoising step $t$. We present the mapping network $\mathcal{M}$ (Sec. \ref{['sec:MappingNet']}) and the attention-based style modulation network (AbSMNet) $\mathcal{T}$ (Sec. \ref{['sec:AbSMNet']}) that are trained across $t$ (Sec. \ref{['sec:Losses']}). $\mathcal{M}$ converts $\mathbf{h}_t$ into the mapped latent code $\mathbf{w}^m_t$, and $\mathcal{T}$ uses $\mathbf{f}_t$ and $\mathbf{a}_t$ to control the facial attributes from the text prompt $c$ and visual input $\mathbf{x}$. The modulation codes $\mathbf{w}^\gamma_t$ and $\mathbf{w}^\beta_t$ are then used to scale and shift $\mathbf{w}^m_t$ to produce the final latent code, $\mathbf{w}_t$, that is fed to the pre-trained GAN $\mathcal{G}$. We obtain the generation output ${I'_t}$ from our model $\mathcal{Y}$ and we use the image $I^d_0$ from the U-Net after the entire denoising process for training $\mathcal{T}$ (Sec. \ref{['sec:Losses']}). Note that only the networks with the dashed line ( ) are trainable, while others are frozen.
Figure 3: Visualization of cross-attention maps and intermediate feature maps. (a) represents the semantic relation information between an input text and an input semantic mask in the spatial domain. The meaningful representations of inputs are shown across all denoising steps and $N$ different blocks. (b) represents $N$ different cross-attention maps, $\mathbf{A}_t$, at denoising steps $t=T$ and $t=0$. (c) shows the example of refined intermediate feature map $\hat{\mathbf{F}}^1_T$ at $1$st block and $t=T$ that is emphasized corresponding to input multi-modal conditions. The red and yellow regions of the map indicate higher attention scores. As the denoising step approaches $T$, the text-relevant features appear more clearly, and as the denoising step $t$ approaches 0, the features of the visual input are more preserved.
Figure 4: Style modulation network in $\mathcal{T}$. The refined intermediate feature maps $\hat{\mathbf{F}}_t$ and $\hat{\bar{\mathbf{F}}}_t$ are used to capture local and global semantic representations, respectively. They are fed into the scale and shift network, respectively. The weighted summations of these outputs are used as input to the map2style network, which finally generates the scale and shift modulation latent codes, $\mathbf{w}^\gamma_t$, and $\mathbf{w}^\beta_t$.
Figure 5: Visual examples of the 2D face image generation using a text prompt and a semantic mask. For each semantic mask, we use three different text prompts (a)-(c), resulting in different output images (a)-(c).
...and 5 more figures

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

TL;DR

Abstract

Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (10)