Semantic Image Synthesis with Spatially-Adaptive Normalization
Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu
TL;DR
The paper addresses semantic image synthesis from segmentation masks and identifies a weakness in conventional normalization-based generators, where semantic information is washed out. It introduces SPADE, a spatially-adaptive normalization layer that generates per-pixel modulation parameters from the input label map and applies them to normalized activations, thereby preserving semantic structure. A lightweight SPADE-based generator with ResNet blocks achieves superior fidelity and layout alignment across diverse datasets and supports multi-modal and style-guided outputs. Extensive ablations and comparisons demonstrate the effectiveness of spatial modulation over simple concatenation and highlight robustness across architectural choices, with code released for public use.
Abstract
We propose spatially-adaptive normalization, a simple but effective layer for synthesizing photorealistic images given an input semantic layout. Previous methods directly feed the semantic layout as input to the deep network, which is then processed through stacks of convolution, normalization, and nonlinearity layers. We show that this is suboptimal as the normalization layers tend to ``wash away'' semantic information. To address the issue, we propose using the input layout for modulating the activations in normalization layers through a spatially-adaptive, learned transformation. Experiments on several challenging datasets demonstrate the advantage of the proposed method over existing approaches, regarding both visual fidelity and alignment with input layouts. Finally, our model allows user control over both semantic and style. Code is available at https://github.com/NVlabs/SPADE .
