Table of Contents
Fetching ...

Enhanced Multi-Scale Cross-Attention for Person Image Generation

Hao Tang, Ling Shao, Nicu Sebe, Luc Van Gool

TL;DR

This work tackles photorealistic person image generation conditioned on source imagery and target poses by introducing XingGAN and XingGAN++. The core idea is cross-attention between appearance and shape modalities, implemented through two collaborative branches (SA and AS) that progressively refine appearance and shape representations. XingGAN++ extends this with multi-scale cross-attention, enhanced attention to stabilize correlations, and densely connected co-attention fusion, achieving state-of-the-art GAN-based performance while remaining significantly faster than diffusion models. Extensive experiments on Market-1501 and DeepFashion demonstrate superior quality and robustness, with strong ablations validating each component's contribution and showing broad applicability to related multi-modal generation tasks.

Abstract

In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.

Enhanced Multi-Scale Cross-Attention for Person Image Generation

TL;DR

This work tackles photorealistic person image generation conditioned on source imagery and target poses by introducing XingGAN and XingGAN++. The core idea is cross-attention between appearance and shape modalities, implemented through two collaborative branches (SA and AS) that progressively refine appearance and shape representations. XingGAN++ extends this with multi-scale cross-attention, enhanced attention to stabilize correlations, and densely connected co-attention fusion, achieving state-of-the-art GAN-based performance while remaining significantly faster than diffusion models. Extensive experiments on Market-1501 and DeepFashion demonstrate superior quality and robustness, with strong ablations validating each component's contribution and showing broad applicability to related multi-modal generation tasks.

Abstract

In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
Paper Structure (8 sections, 12 equations, 8 figures, 9 tables)

This paper contains 8 sections, 12 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of the proposed XingGAN++ for the person image generation task. Both the shape-guided appearance-based generation (SA) and the appearance-guided shape-based generation (AS) branches consist of a sequence of interweaved enhanced multi-scale SA and AS blocks. For brevity, we omit both the appearance-guided and shape-guided discriminator in the figure. All these components are trained in an end-to-end fashion so that the SA and AS branches can benefit from each other to generate more shape-consistent and appearance-consistent person images.
  • Figure 2: Structure of the proposed shape-guided appearance-based generation (SA) block (proposed in our conference version tang2020xinggan), which takes the previous appearance code $F_{t-1}^I$ and the previous shape code $F_{t-1}^P$ as input and obtains the appearance code $F_{t}^I$ in a crossed non-local way. The symbols $\oplus$, $\otimes$, and $\textcircled{s}$ denote element-wise addition, element-wise multiplication, and Softmax activation, respectively.
  • Figure 3: Structure of the proposed enhanced multi-scale SA block (proposed in this journal version), which takes the previous appearance code $F_{t-1}^I$ and the previous shape code $F_{t-1}^P$ as input and obtains the appearance code $F_{t}^I$ in a multi-scale crossed non-local way with our enhanced attention (EA) mechanism. The symbols $\oplus$, $\otimes$, and $\textcircled{s}$ denote element-wise addition, element-wise multiplication, and Softmax activation, respectively.
  • Figure 4: Qualitative comparisons on Market-1501. (a) From left to right: Source Image ($I_s$), Source Pose ($P_s$), Target Pose ($P_t$), Target Image ($I_t$), PG2 ma2017pose, VUNet esser2018variational, PoseGAN siarohin2018deformable, XingGAN tang2020xinggan, and XingGAN++ (Ours). (b) From left to right: Source Image ($I_s$), Source Pose ($P_s$), Target Pose ($P_t$), Target Image ($I_t$), PATN zhu2019progressive, PoseStylizer huang2020generating, SPGNet lv2021learning, GFLA ren2020deep, XingGAN tang2020xinggan, and XingGAN++ (Ours).
  • Figure 5: Qualitative comparisons of person pose generation on DeepFashion. (a) From left to right: Source Image ($I_s$), Source Pose ($P_s$), Target Pose ($P_t$), Target Image ($I_t$), PG2 ma2017pose, VUNet esser2018variational, PoseGAN siarohin2018deformable, XingGAN tang2020xinggan, and XingGAN++ (Ours). (b) From left to right: Source Image ($I_s$), Source Pose ($P_s$), Target Pose ($P_t$), Target Image ($I_t$), PATN zhu2019progressive, PoseStylizer huang2020generating, SPGNet lv2021learning, NTED ren2022neural, XingGAN tang2020xinggan, PIDM bhunia2023person, and XingGAN++ (Ours).
  • ...and 3 more figures