Table of Contents
Fetching ...

Variation-Aware Semantic Image Synthesis

Mingle Xu, Jaehwan Lee, Sook Yoon, Hyongsuk Kim, Dong Sun Park

TL;DR

This work addresses class-level mode collapse in semantic image synthesis by separating variation into inter- and intra-class components and proposing variation-aware SIS (VASIS). It introduces two lightweight mechanisms—semantic noise and a learnable position code—integrated into conditional normalization to boost intra-class diversity while preserving inter-class variation. Through analyses and experiments on Cityscapes, ADE20k, and COCO-Stuff, VASIS-based variants achieve more natural images and often better FID and mIoU under comparable training conditions, with reduced parameter overhead. The results highlight the importance of intra-class variation for realism in SIS and offer practical, extensible improvements with public code release planned.

Abstract

Semantic image synthesis (SIS) aims to produce photorealistic images aligning to given conditional semantic layout and has witnessed a significant improvement in recent years. Although the diversity in image-level has been discussed heavily, class-level mode collapse widely exists in current algorithms. Therefore, we declare a new requirement for SIS to achieve more photorealistic images, variation-aware, which consists of inter- and intra-class variation. The inter-class variation is the diversity between different semantic classes while the intra-class variation stresses the diversity inside one class. Through analysis, we find that current algorithms elusively embrace the inter-class variation but the intra-class variation is still not enough. Further, we introduce two simple methods to achieve variation-aware semantic image synthesis (VASIS) with a higher intra-class variation, semantic noise and position code. We combine our method with several state-of-the-art algorithms and the experimental result shows that our models generate more natural images and achieves slightly better FIDs and/or mIoUs than the counterparts. Our codes and models will be publicly available.

Variation-Aware Semantic Image Synthesis

TL;DR

This work addresses class-level mode collapse in semantic image synthesis by separating variation into inter- and intra-class components and proposing variation-aware SIS (VASIS). It introduces two lightweight mechanisms—semantic noise and a learnable position code—integrated into conditional normalization to boost intra-class diversity while preserving inter-class variation. Through analyses and experiments on Cityscapes, ADE20k, and COCO-Stuff, VASIS-based variants achieve more natural images and often better FID and mIoU under comparable training conditions, with reduced parameter overhead. The results highlight the importance of intra-class variation for realism in SIS and offer practical, extensible improvements with public code release planned.

Abstract

Semantic image synthesis (SIS) aims to produce photorealistic images aligning to given conditional semantic layout and has witnessed a significant improvement in recent years. Although the diversity in image-level has been discussed heavily, class-level mode collapse widely exists in current algorithms. Therefore, we declare a new requirement for SIS to achieve more photorealistic images, variation-aware, which consists of inter- and intra-class variation. The inter-class variation is the diversity between different semantic classes while the intra-class variation stresses the diversity inside one class. Through analysis, we find that current algorithms elusively embrace the inter-class variation but the intra-class variation is still not enough. Further, we introduce two simple methods to achieve variation-aware semantic image synthesis (VASIS) with a higher intra-class variation, semantic noise and position code. We combine our method with several state-of-the-art algorithms and the experimental result shows that our models generate more natural images and achieves slightly better FIDs and/or mIoUs than the counterparts. Our codes and models will be publicly available.
Paper Structure (11 sections, 4 equations, 3 figures, 5 tables)

This paper contains 11 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Examples of class-level mode collapse on ADE20k. Similar patterns in red circle exist when similar semantic layout are given while our method eases the class-level mode collapse.
  • Figure 2: (a) The generated images by replacing the zero-padding with the reflect-padding with model trained in Cityscapes dataset. (b) The standard deviation of features in different blocks.
  • Figure 3: Generated samples in Cityscapes and COCO-Stuff. From the left to the right are input semantic layout, ground truth, SAPDE, CLADE-ICPE, OASIS, and VA-OASIS (ours). The first three rows are from Cityscapes while the left images are from COCO-Stuff.