MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Debin Meng; Christos Tzelepis; Ioannis Patras; Georgios Tzimiropoulos

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Debin Meng, Christos Tzelepis, Ioannis Patras, Georgios Tzimiropoulos

TL;DR

MM2Latent tackles controllable multimodal face generation and editing by integrating StyleGAN2 with FaRL text encoding and dedicated autoencoders for spatial modalities (mask, sketch, 3DMM). A learnable MappingNet maps multimodal inputs into the StyleGAN $\mathcal{W}$ latent space, aided by a pseudo text embedding generation strategy to bridge training and inference. The framework enables hyperparameter-free inference and real-image editing, achieving state-of-the-art results in multimodal consistency and image quality while offering fast inference relative to diffusion-based methods. Its practical impact lies in providing flexible, high-quality facial synthesis and editing tools that leverage both semantic and spatial cues with efficient performance.

Abstract

Generating human portraits is a hot topic in the image generation area, e.g. mask-to-face generation and text-to-face generation. However, these unimodal generation methods lack controllability in image generation. Controllability can be enhanced by exploring the advantages and complementarities of various modalities. For instance, we can utilize the advantages of text in controlling diverse attributes and masks in controlling spatial locations. Current state-of-the-art methods in multimodal generation face limitations due to their reliance on extensive hyperparameters, manual operations during the inference stage, substantial computational demands during training and inference, or inability to edit real images. In this paper, we propose a practical framework - MM2Latent - for multimodal image generation and editing. We use StyleGAN2 as our image generator, FaRL for text encoding, and train an autoencoders for spatial modalities like mask, sketch and 3DMM. We propose a strategy that involves training a mapping network to map the multimodal input into the w latent space of StyleGAN. The proposed framework 1) eliminates hyperparameters and manual operations in the inference stage, 2) ensures fast inference speeds, and 3) enables the editing of real images. Extensive experiments demonstrate that our method exhibits superior performance in multimodal image generation, surpassing recent GAN- and diffusion-based methods. Also, it proves effective in multimodal image editing and is faster than GAN- and diffusion-based methods. We make the code publicly available at: https://github.com/Open-Debin/MM2Latent

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

TL;DR

latent space, aided by a pseudo text embedding generation strategy to bridge training and inference. The framework enables hyperparameter-free inference and real-image editing, achieving state-of-the-art results in multimodal consistency and image quality while offering fast inference relative to diffusion-based methods. Its practical impact lies in providing flexible, high-quality facial synthesis and editing tools that leverage both semantic and spatial cues with efficient performance.

Abstract

Paper Structure (26 sections, 9 equations, 6 figures, 4 tables)

This paper contains 26 sections, 9 equations, 6 figures, 4 tables.

Introduction
Related Work
Image Generation
Conditional Face Generation
Face Manipulation
Proposed Method
Main components of MM2Latent
The designing of multimodal fusion
The encoding of text
The encoding of mask
The encoding of sketch
The encoding of 3DMM
The image generator
Training losses
Training the whole framework
...and 11 more sections

Figures (6)

Figure 1: We propose MM2Latent, a versatile framework for multimodal image generation and editing using facial segmentation masks, sketches, and 3DMM parameters.
Figure 2: Overview of the proposed MM2Latent's training process. First, the mask autoencoder is trained followed by the training of the MappingNet while keeping the other modules fixed -- note that we show only the mask modality for brevity.
Figure 3: Multimodal image generation. Each generated image is accompanied by a textual description below it and a spatial mask, sketch, or 3DMM to its left.
Figure 4: Multimodal spatial editing. we focus on modifying the shape of the original image according to targeted spatial information, while preserving its inherent attributes.
Figure 5: Text-driven image editing. (a) The multimodal text-driven editing in our framework shows more faithful results, effectively fixing the facial shape and avoiding unwanted changes, (b) real image editing with changed degree.
...and 1 more figures

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

TL;DR

Abstract

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Authors

TL;DR

Abstract

Table of Contents

Figures (6)