Table of Contents
Fetching ...

StarGAN v2: Diverse Image Synthesis for Multiple Domains

Yunjey Choi, Youngjung Uh, Jaejun Yoo, Jung-Woo Ha

TL;DR

StarGAN v2 tackles the challenge of scalable, diverse image-to-image translation across multiple domains by introducing domain-specific style codes learned via a mapping network and a style encoder, applied to a single generator with AdaIN. The method supports latent-guided and reference-guided synthesis, leveraging a multi-task discriminator and a combination of adversarial, style reconstruction, diversity, and cycle losses. Empirical results on CelebA-HQ and AFHQ show substantial improvements in image quality (lower FID) and diversity (higher LPIPS) compared with prior multi-domain and two-domain baselines, with ablations validating the design choices. The authors also release the AFHQ dataset and provide code and pretrained models to facilitate broader evaluation.

Abstract

A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter- and intra-domain differences. The code, pretrained models, and dataset can be found at https://github.com/clovaai/stargan-v2.

StarGAN v2: Diverse Image Synthesis for Multiple Domains

TL;DR

StarGAN v2 tackles the challenge of scalable, diverse image-to-image translation across multiple domains by introducing domain-specific style codes learned via a mapping network and a style encoder, applied to a single generator with AdaIN. The method supports latent-guided and reference-guided synthesis, leveraging a multi-task discriminator and a combination of adversarial, style reconstruction, diversity, and cycle losses. Empirical results on CelebA-HQ and AFHQ show substantial improvements in image quality (lower FID) and diversity (higher LPIPS) compared with prior multi-domain and two-domain baselines, with ablations validating the design choices. The authors also release the AFHQ dataset and provide code and pretrained models to facilitate broader evaluation.

Abstract

A good image-to-image translation model should learn a mapping between different visual domains while satisfying the following properties: 1) diversity of generated images and 2) scalability over multiple domains. Existing methods address either of the issues, having limited diversity or multiple models for all domains. We propose StarGAN v2, a single framework that tackles both and shows significantly improved results over the baselines. Experiments on CelebA-HQ and a new animal faces dataset (AFHQ) validate our superiority in terms of visual quality, diversity, and scalability. To better assess image-to-image translation models, we release AFHQ, high-quality animal faces with large inter- and intra-domain differences. The code, pretrained models, and dataset can be found at https://github.com/clovaai/stargan-v2.

Paper Structure

This paper contains 15 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Diverse image synthesis results on the CelebA-HQ dataset and the newly collected animal faces (AFHQ) dataset. The first column shows input images while the remaining columns are images synthesized by StarGAN v2.
  • Figure 2: Overview of StarGAN v2, consisting of four modules. (a) The generator translates an input image into an output image reflecting the domain-specific style code. (b) The mapping network transforms a latent code into style codes for multiple domains, one of which is randomly selected during training. (c) The style encoder extracts the style code of an image, allowing the generator to perform reference-guided image synthesis. (d) The discriminator distinguishes between real and fake images from multiple domains. Note that all modules except the generator contain multiple output branches, one of which is selected when training the corresponding domain.
  • Figure 3: Visual comparison of generated images using each configuration in Table \ref{['tab:ablation']}. Note that given a source image, the configurations (a) - (c) provide a single output, while (d) - (f) generate multiple output images.
  • Figure 4: Reference-guided image synthesis results on CelebA-HQ. The source and reference images in the first row and the first column are real images, while the rest are images generated by our proposed model, StarGAN v2. Our model learns to transform a source image reflecting the style of a given reference image. High-level semantics such as hairstyle, makeup, beard and age are followed from the reference images, while the pose and identity of the source images are preserved. Note that the images in each column share a single identity with different styles, and those in each row share a style with different identities.
  • Figure 5: Qualitative comparison of latent-guided image synthesis results on the CelebA-HQ and AFHQ datasets. Each method translates the source images (left-most column) to target domains using randomly sampled latent codes. (a) The top three rows correspond to the results of converting male to female and vice versa in the bottom three rows. (b) Every two rows from the top show the synthesized images in the following order: cat-to-dog, dog-to-wildlife, and wildlife-to-cat.
  • ...and 5 more figures