Table of Contents
Fetching ...

Semantic Image Synthesis with Spatially-Adaptive Normalization

Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu

TL;DR

The paper addresses semantic image synthesis from segmentation masks and identifies a weakness in conventional normalization-based generators, where semantic information is washed out. It introduces SPADE, a spatially-adaptive normalization layer that generates per-pixel modulation parameters from the input label map and applies them to normalized activations, thereby preserving semantic structure. A lightweight SPADE-based generator with ResNet blocks achieves superior fidelity and layout alignment across diverse datasets and supports multi-modal and style-guided outputs. Extensive ablations and comparisons demonstrate the effectiveness of spatial modulation over simple concatenation and highlight robustness across architectural choices, with code released for public use.

Abstract

We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to ``wash away'' semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE .

Semantic Image Synthesis with Spatially-Adaptive Normalization

TL;DR

The paper addresses semantic image synthesis from segmentation masks and identifies a weakness in conventional normalization-based generators, where semantic information is washed out. It introduces SPADE, a spatially-adaptive normalization layer that generates per-pixel modulation parameters from the input label map and applies them to normalized activations, thereby preserving semantic structure. A lightweight SPADE-based generator with ResNet blocks achieves superior fidelity and layout alignment across diverse datasets and supports multi-modal and style-guided outputs. Extensive ablations and comparisons demonstrate the effectiveness of spatial modulation over simple concatenation and highlight robustness across architectural choices, with code released for public use.

Abstract

We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to ``wash away'' semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE .

Paper Structure

This paper contains 8 sections, 3 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: Our model allows user control over both semantic and style as synthesizing an image. The semantic (e.g., the existence of a tree) is controlled via a label map (the top row), while the style is controlled via the reference style image (the leftmost column). Please visit our https://github.com/NVlabs/SPADE for interactive image synthesis demos.
  • Figure 2: In the SPADE, the mask is first projected onto an embedding space and then convolved to produce the modulation parameters $\bm{\gamma}$ and $\bm{\beta}$. Unlike prior conditional normalization methods, $\bm{\gamma}$ and $\bm{\beta}$ are not vectors, but tensors with spatial dimensions. The produced $\bm{\gamma}$ and $\bm{\beta}$ are multiplied and added to the normalized activation element-wise.
  • Figure 3: Comparing results given uniform segmentation maps: while the SPADE generator produces plausible textures, the pix2pixHD generator wang2018pix2pixHD produces two identical outputs due to the loss of the semantic information after the normalization layer.
  • Figure 4: In the SPADE generator, each normalization layer uses the segmentation mask to modulate the layer activations. (left) Structure of one residual block with the SPADE. (right) The generator contains a series of the SPADE residual blocks with upsampling layers. Our architecture achieves better performance with a smaller number of parameters by removing the downsampling layers of leading image-to-image translation networks such as the pix2pixHD model wang2018pix2pixHD.
  • Figure 5: Visual comparison of semantic image synthesis results on the COCO-Stuff dataset. Our method successfully synthesizes realistic details from semantic labels.
  • ...and 15 more figures