Table of Contents
Fetching ...

Towards Open-World Text-Guided Face Image Generation and Manipulation

Weihao Xia, Yujiu Yang, Jing-Hao Xue, Baoyuan Wu

TL;DR

TediGAN tackles the problem of open-world, high-resolution text-guided face generation and manipulation. It introduces two strategies that reuse a pretrained StyleGAN as a prior: (i) an inversion-based pipeline with a visual-linguistic alignment module, and (ii) a pretrained-language-model-guided optimization (e.g., CLIP) for open-world inputs, including region-of-interest edits. The framework yields 1024×1024 outputs, supports multiple modalities (text, sketches, labels), and introduces the Multi-Modal CelebA-HQ dataset to facilitate research. Experiments show superior quality, diversity, and text fidelity compared with state-of-the-art methods, and demonstrate robust open-world and ROI capabilities. The work provides a scalable path toward flexible, multimodal face synthesis without retraining, with clear avenues for future efficiency and generalization improvements.

Abstract

The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

Towards Open-World Text-Guided Face Image Generation and Manipulation

TL;DR

TediGAN tackles the problem of open-world, high-resolution text-guided face generation and manipulation. It introduces two strategies that reuse a pretrained StyleGAN as a prior: (i) an inversion-based pipeline with a visual-linguistic alignment module, and (ii) a pretrained-language-model-guided optimization (e.g., CLIP) for open-world inputs, including region-of-interest edits. The framework yields 1024×1024 outputs, supports multiple modalities (text, sketches, labels), and introduces the Multi-Modal CelebA-HQ dataset to facilitate research. Experiments show superior quality, diversity, and text fidelity compared with state-of-the-art methods, and demonstrate robust open-world and ROI capabilities. The work provides a scalable path toward flexible, multimodal face synthesis without retraining, with clear avenues for future efficiency and generalization improvements.

Abstract

The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at https://github.com/weihaox/TediGAN.

Paper Structure

This paper contains 24 sections, 8 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Our TediGAN unifies text-guided image generation and manipulation into one framework, leading to continuous operations from generation to manipulation (a), inherent support of image synthesis from multi-modal inputs (b), and high-resolution synthesis (c).
  • Figure 2: Overview of our proposed method. We propose two strategies to use a pretrained GAN model. (a) demonstrates the first strategy. The key idea is to project multi-modal embedding into the common $\mathcal{W}$ space of StyleGAN. Taking visual and linguistic embedding for example, with the learned the inversion module, we can then learn the visual-linguistic similarity, where the visual embedding ${\rm\bf w}^v$ and linguistic embedding ${\rm\bf w}^l$ are expected to be close enough. The instance-level optimization if for identity preservation. The edited image can be generated from the StyleGAN generator. (b) illustrates the inference of text-guided image manipulation using the text encoder. Given a source image and a text guidance, we first get their embedding ${\rm\bf w}^v$ and ${\rm\bf w}^l$ in $\mathcal{W}$ space through corresponding encoders. We then perform style mixing for target layers and get the target latent code ${\rm\bf w}^t$. The final ${\rm\bf w}^{t*}$ is obtained through instance-level optimization. For image generation, we can directly obtain the results by feeding the latent codes from the text encoder into the generator. (c) is the illustration of text-guided image manipulation using a pretrained language model. In (d), we show that such optimization can be easily extended to support region-of-interest manipulation.
  • Figure 3: Diverse high-resolution results from multimodal inputs with textual guidance. Our method achieves text-guided diverse image generation and manipulation up to an unprecedented resolution at 1024 $\times$ 1024.
  • Figure 4: Comparison of text-to-image generation on our Multi-modal CelebA-HQ dataset. TediGAN-A and -B represents two strategies proposed in Section \ref{['subsec:train-text-encoder']} and Section \ref{['subsec:pretrained-text-encoder']}.
  • Figure 5: Qualitative comparison of image manipulation using natural language descriptions.
  • ...and 11 more figures