Table of Contents
Fetching ...

Hierarchical Cross-Attention Network for Virtual Try-On

Hao Tang, Bin Ren, Pingping Wu, Nicu Sebe

TL;DR

HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that not only excel in accuracy but also satisfy subjective criteria of realism, a significant step forward in advancing the field of virtual try-on technologies.

Abstract

In this paper, we present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet). HCANet is crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic virtual try-on outcomes. A key feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block enhances the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our experiments establish the prowess of HCANet. The results showcase its performance across both quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that excel in accuracy and realism. This marks a significant step in advancing virtual try-on technologies.

Hierarchical Cross-Attention Network for Virtual Try-On

TL;DR

HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that not only excel in accuracy but also satisfy subjective criteria of realism, a significant step forward in advancing the field of virtual try-on technologies.

Abstract

In this paper, we present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet). HCANet is crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic virtual try-on outcomes. A key feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities. The HCA block enhances the depth and robustness of the network. By adopting a hierarchical approach, it facilitates a nuanced representation of the interaction between the person and clothing, capturing intricate details essential for an authentic virtual try-on experience. Our experiments establish the prowess of HCANet. The results showcase its performance across both quantitative metrics and subjective evaluations of visual realism. HCANet stands out as a state-of-the-art solution, demonstrating its capability to generate virtual try-on results that excel in accuracy and realism. This marks a significant step in advancing virtual try-on technologies.

Paper Structure

This paper contains 13 sections, 11 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: An overview of the proposed HCANet for the virtual try-on task reveals two key generative stages: geometric matching and try-on. Within both stages, we incorporate a novel hierarchical cross-attention (HCA) block designed to capture extensive correlations between the person and clothing modalities. The entire system operates in an end-to-end manner, fostering mutual enhancement between the stages to yield clothing images that are not only shape-consistent but also appearance-consistent. Conducting cross-attention operations twice in both stages of the HCA blocks offers several advantages. First, it allows for a more comprehensive integration of information from different modalities, such as RGB, joints, and binary masks, at multiple stages of the network. This can enhance the model's ability to capture complex relationships between these modalities and improve overall performance. Second, conducting cross-attention operations at multiple stages enables the model to refine its representations iteratively, facilitating more effective cross-modal interactions and feature fusion. This iterative refinement process can lead to better feature representations and ultimately improve the performance of the model.
  • Figure 2: Structure of the proposed hierarchical cross-attention (HCA) block, which takes both the person representation and the clothing representation as inputs and produces the final fused interactive mutual correlation feature through a hierarchical cross-operation. Specifically, on the left, cross-attention is performed on the person representations $X_p^1, X_p^2$, and $X_p^3$, giving the updated $\hat{X}_p$ as the output. On the right, cross-attention is computed between the clothing representation $X_c$ and the updated person representation $\hat{X}_p$. The symbols $\oplus$, $\otimes$, $\textcircled{s}$, and $\textcircled{c}$ denote element-wise addition, element-wise multiplication, Softmax activation, and channel-wise concatenation, respectively.
  • Figure 3: Qualitative comparisons of the clothes warped by CP-VTON wang2018toward, CP-VTON+ minar2020cp, and ACGPN yang2020towards in the first geometric matching stage. To the left of the dashed line are same-pair (retry-on) cases, while to the right are the different-pair cases.
  • Figure 4: Qualitative comparison with CP-VTON wang2018toward, CP-VTON+ minar2020cp, ACGPN yang2020towards, PF-AFN ge2021parser, FS-VTON he2022style in the second try-on stage.
  • Figure 5: Try-on results of clothes with complex textures.
  • ...and 1 more figures