Table of Contents
Fetching ...

SwinStyleformer is a favorable choice for image inversion

Jiawei Mao, Guangyi Zhao, Xuesong Yin, Yuanqi Chang

TL;DR

This work tackles the challenge of image inversion with Transformer architectures by introducing SwinStyleformer, the first pure Transformer-based inversion network. It combines a Swin Transformer backbone with a multi-scale feature pyramid, a map2style network built on learnable-queries attention, and targeted losses (distribution alignment and an inversion discriminator) to bridge the gap between Transformer latent codes and StyleGAN styles. The approach achieves state-of-the-art results in image inversion, editing, face-from-segmentation, and super-resolution across multiple domains, while maintaining efficiency. By addressing Transformer-specific limitations in local detail, multi-scale modeling, and latent-distribution alignment, the method demonstrates that Transformer-based inversion can rival and surpass CNN-based approaches in fidelity and usability.

Abstract

This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.

SwinStyleformer is a favorable choice for image inversion

TL;DR

This work tackles the challenge of image inversion with Transformer architectures by introducing SwinStyleformer, the first pure Transformer-based inversion network. It combines a Swin Transformer backbone with a multi-scale feature pyramid, a map2style network built on learnable-queries attention, and targeted losses (distribution alignment and an inversion discriminator) to bridge the gap between Transformer latent codes and StyleGAN styles. The approach achieves state-of-the-art results in image inversion, editing, face-from-segmentation, and super-resolution across multiple domains, while maintaining efficiency. By addressing Transformer-specific limitations in local detail, multi-scale modeling, and latent-distribution alignment, the method demonstrates that Transformer-based inversion can rival and surpass CNN-based approaches in fidelity and usability.

Abstract

This paper proposes the first pure Transformer structure inversion network called SwinStyleformer, which can compensate for the shortcomings of the CNNs inversion framework by handling long-range dependencies and learning the global structure of objects. Experiments found that the inversion network with the Transformer backbone could not successfully invert the image. The above phenomena arise from the differences between CNNs and Transformers, such as the self-attention weights favoring image structure ignoring image details compared to convolution, the lack of multi-scale properties of Transformer, and the distribution differences between the latent code extracted by the Transformer and the StyleGAN style vector. To address these differences, we employ the Swin Transformer with a smaller window size as the backbone of the SwinStyleformer to enhance the local detail of the inversion image. Meanwhile, we design a Transformer block based on learnable queries. Compared to the self-attention transformer block, the Transformer block based on learnable queries provides greater adaptability and flexibility, enabling the model to update the attention weights according to specific tasks. Thus, the inversion focus is not limited to the image structure. To further introduce multi-scale properties, we design multi-scale connections in the extraction of feature maps. Multi-scale connections allow the model to gain a comprehensive understanding of the image to avoid loss of detail due to global modeling. Moreover, we propose an inversion discriminator and distribution alignment loss to minimize the distribution differences. Based on the above designs, our SwinStyleformer successfully solves the Transformer's inversion failure issue and demonstrates SOTA performance in image inversion and several related vision tasks.
Paper Structure (54 sections, 2 equations, 27 figures, 12 tables)

This paper contains 54 sections, 2 equations, 27 figures, 12 tables.

Figures (27)

  • Figure 1: Differences in image structure between convolutional backbone and Transformer backbone inversion results. The green boxes cover the facial outline of the inversion results for the different frameworks. The red boxes represents the size and location of the target's facial outline. The size and location of the facial outline of our results nearly overlap with the target.
  • Figure 2: SwinStyleformer can perform well in image inversion and several tasks related to it. Examples include facial image inversion, image inversion on different domains, image inversion for specific details, image editing, image editing for specific details, image super resolution, and face from semantic segmentation map.
  • Figure 3: Comparison of pSp with Swin Transformer backbone and SwinStyleformer.
  • Figure 4: SwinStyleformer overall architecture. $A$ denotes the affine transformation corresponding to the latent code. $N_1$, $N_2$ denote the depth required for the sequence of tokens to the length of 16 and 1, respectively.
  • Figure 5: Overview of the W-MSA based on learnable queries, heat map of W-MSA and W-MSA based on learnable queries. We visualize the heat map with the difference between the inversion image and the input image to show the focused inversion region. It can be found that our method increases the attention to image details while retaining the attention to image structure.
  • ...and 22 more figures