Table of Contents
Fetching ...

Semantic Image Synthesis with Unconditional Generator

Jungwoo Chae, Hyunin Cho, Sooyeon Go, Kyungmook Choi, Youngjung Uh

TL;DR

This work addresses semantic image synthesis under limited supervision by repurposing a pretrained unconditional generator. It introduces a rearranger that guides feature-map rewriting to match proxy masks derived from online clustering, and a semantic mapper that translates user inputs into those proxy masks, enabling flexible conditioning such as sketches, edges, or scribbles. The approach is trained largely in a self-supervised manner with loss terms that align rearranged features to proxy masks and preserve source styles, while enabling exemplar-guided generation through proxy structures. Across multiple datasets and input conditions, the method achieves strong mask fidelity (high mIoU) and competitive image quality (low FID), with notable data efficiency and applicability to free-form editing, albeit with limitations in pixel-level detail due to proxy-mask resolution.

Abstract

Semantic image synthesis (SIS) aims to generate realistic images that match given semantic masks. Despite recent advances allowing high-quality results and precise spatial control, they require a massive semantic segmentation dataset for training the models. Instead, we propose to employ a pre-trained unconditional generator and rearrange its feature maps according to proxy masks. The proxy masks are prepared from the feature maps of random samples in the generator by simple clustering. The feature rearranger learns to rearrange original feature maps to match the shape of the proxy masks that are either from the original sample itself or from random samples. Then we introduce a semantic mapper that produces the proxy masks from various input conditions including semantic masks. Our method is versatile across various applications such as free-form spatial editing of real images, sketch-to-photo, and even scribble-to-photo. Experiments validate advantages of our method on a range of datasets: human faces, animal faces, and buildings.

Semantic Image Synthesis with Unconditional Generator

TL;DR

This work addresses semantic image synthesis under limited supervision by repurposing a pretrained unconditional generator. It introduces a rearranger that guides feature-map rewriting to match proxy masks derived from online clustering, and a semantic mapper that translates user inputs into those proxy masks, enabling flexible conditioning such as sketches, edges, or scribbles. The approach is trained largely in a self-supervised manner with loss terms that align rearranged features to proxy masks and preserve source styles, while enabling exemplar-guided generation through proxy structures. Across multiple datasets and input conditions, the method achieves strong mask fidelity (high mIoU) and competitive image quality (low FID), with notable data efficiency and applicability to free-form editing, albeit with limitations in pixel-level detail due to proxy-mask resolution.

Abstract

Semantic image synthesis (SIS) aims to generate realistic images that match given semantic masks. Despite recent advances allowing high-quality results and precise spatial control, they require a massive semantic segmentation dataset for training the models. Instead, we propose to employ a pre-trained unconditional generator and rearrange its feature maps according to proxy masks. The proxy masks are prepared from the feature maps of random samples in the generator by simple clustering. The feature rearranger learns to rearrange original feature maps to match the shape of the proxy masks that are either from the original sample itself or from random samples. Then we introduce a semantic mapper that produces the proxy masks from various input conditions including semantic masks. Our method is versatile across various applications such as free-form spatial editing of real images, sketch-to-photo, and even scribble-to-photo. Experiments validate advantages of our method on a range of datasets: human faces, animal faces, and buildings.
Paper Structure (32 sections, 6 equations, 21 figures, 3 tables)

This paper contains 32 sections, 6 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Conditional GANs vs Ours (Left) Conditional GANs train a generator using expensive pairs of semantic masks and images. (Right) On the other hand, our method does not require a large number of image-mask pairs in training process and uses a pre-trained unconditional generator for semantic image synthesis. Furthermore, it accommodates various types of inputs such as sketches or even simpler scribbles.
  • Figure 2: Rearranger achieves self-supervised learning by incorporating self-reconstruction loss and mask loss, reducing the reliance on a large image-mask dataset. $G_1$ is earlier layers of the generator and $G_2$ is later layers after the proxy mask resolution.
  • Figure 3: Semantic Mapper transforms input masks that the generator cannot comprehend into proxy masks, enabling the generator to understand and synthesize the corresponding output. When training the semantic mapper, a one-shot segmentation network and reconstruction loss are used.
  • Figure 4: Comparison with LinearGAN when receiving the same mask. LinearGAN is an indirectly optimized method that adjusts the mask through optimization, while our approach is designed to create images that better fit the given mask. Our method performs better by producing more accurate images that align with the masks. Both LinearGAN and Ours use a single mask to train each model.
  • Figure 5: The comparison results between our method and supervised models using the same mask (Including one-shot setting). Compared to existing SIS methods, Ours produces results that are closer to the ground truth. Additionally, Our result shows the most natural-looking output.
  • ...and 16 more figures