Table of Contents
Fetching ...

Semantic Image Synthesis via Class-Adaptive Cross-Attention

Tomaso Fontanini, Claudio Ferrari, Giuseppe Lisanti, Massimo Bertozzi, Andrea Prati

TL;DR

This work targets semantic image synthesis and addresses global-inconsistency issues in SPADE-based conditioning by introducing a cross-attention–based framework, CA^2-SIS, that learns shape–style correlations. It pairs a Multi-Resolution Grouped Style Encoder with a Mask Embedder to feed per-class style codes into a Cross-Attention Generator, reinforced by an attention loss $\mathcal{L}_{att}$ that aligns attention maps with the semantic mask. The approach yields strong reconstruction and editing capabilities, including style transfer and shape manipulation, while delivering improved global consistency and robust performance against mask noise, outperforming SPADE-based methods and remaining competitive with StyleGAN-based approaches on several datasets. However, shape-transfer can still struggle under strong misalignment, and strong inter-class style correlations may reduce local controllability in some cases. Overall, CA^2-SIS demonstrates that replacing SPADE with cross-attention provides a versatile, scalable path toward higher-quality, controllable semantic image synthesis with practical editing capabilities.

Abstract

In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.

Semantic Image Synthesis via Class-Adaptive Cross-Attention

TL;DR

This work targets semantic image synthesis and addresses global-inconsistency issues in SPADE-based conditioning by introducing a cross-attention–based framework, CA^2-SIS, that learns shape–style correlations. It pairs a Multi-Resolution Grouped Style Encoder with a Mask Embedder to feed per-class style codes into a Cross-Attention Generator, reinforced by an attention loss that aligns attention maps with the semantic mask. The approach yields strong reconstruction and editing capabilities, including style transfer and shape manipulation, while delivering improved global consistency and robust performance against mask noise, outperforming SPADE-based methods and remaining competitive with StyleGAN-based approaches on several datasets. However, shape-transfer can still struggle under strong misalignment, and strong inter-class style correlations may reduce local controllability in some cases. Overall, CA^2-SIS demonstrates that replacing SPADE with cross-attention provides a versatile, scalable path toward higher-quality, controllable semantic image synthesis with practical editing capabilities.

Abstract

In semantic image synthesis the state of the art is dominated by methods that use customized variants of the SPatially-Adaptive DE-normalization (SPADE) layers, which allow for good visual generation quality and editing versatility. By design, such layers learn pixel-wise modulation parameters to de-normalize the generator activations based on the semantic class each pixel belongs to. Thus, they tend to overlook global image statistics, ultimately leading to unconvincing local style editing and causing global inconsistencies such as color or illumination distribution shifts. Also, SPADE layers require the semantic segmentation mask for mapping styles in the generator, preventing shape manipulations without manual intervention. In response, we designed a novel architecture where cross-attention layers are used in place of SPADE for learning shape-style correlations and so conditioning the image generation process. Our model inherits the versatility of SPADE, at the same time obtaining state-of-the-art generation quality, as well as improved global and local style transfer. Code and models available at https://github.com/TFonta/CA2SIS.
Paper Structure (19 sections, 7 equations, 15 figures, 5 tables)

This paper contains 19 sections, 7 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: (CA)$^2$-SIS architecture: style codes are extracted using a Multi-Resolution Style Encoder $\mathcal{E}_s$ equipped with grouped convolutions; the Mask Embedder $\mathcal{E}_m$ embeds each of the semantic mask parts into a set of latent codes; finally, these codes are fed to the Cross-Attention Generator $\mathcal{G}$ that is conditioned with the style codes thanks to the cross-attention mechanism. Additionally, the semantic mask is also used in the cross-attention layer to calculate the attention loss $\mathcal{L}_{att}$ in order to push each attention map to follow the mask shape.
  • Figure 2: Cross-Attention layer: The Query $Q$ is derived from the features of the previous residual block, while Key $K$ and Value $V$ are calculated starting from the multi-resolution style codes. Additionaly, an attention loss between the output of the Softmax and the semantic map $\mathcal{L}_{att}$ is calculated.
  • Figure 3: Qualitative comparison between state-of-the-art methods and our architecture on CelebMask-HQ. Our approach better preserves the color distribution (top row) and illumination coherence (second row).
  • Figure 4: Qualitative comparison between state-of-the-art methods and our architecture on Ade20k and DeepFashion. Our approach better preserves the color distribution (top row) and illumination coherence (second and bottom rows). MaskGAN is not shown since is trainable only for face images.
  • Figure 5: Comparison between V-INADE, SEAN and (CA)$^2$-SIS when transferring the style of all face parts. Our method convincingly generates realistic results even if a specific style i.e. opposite ear in the top row, eyeglasses in the middle row, and teeth in the bottom row, are absent in the reference image.
  • ...and 10 more figures