Table of Contents
Fetching ...

Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On

Delong Zhang, Qiwei Huang, Yuanliu Liu, Yang Sun, Wei-Shi Zheng, Pengfei Xiong, Wei Zhang

TL;DR

This work tackles realistic virtual try-on by bridging warp-based and learning-based approaches through Flow Infused Attention (FIA-VTON). A dense warp flow map $F$ is infused into a diffusion model as implicit guidance, with Flow Guider supplying $F$ and Spatial Guider providing high-level garment semantics via FashionCLIP, all integrated through FIA within the Denoising UNet. The method replaces standard cross-attention with flow- and space-guided attention, enabling accurate deformation and texture preservation while reducing warping artifacts. On VITON-HD and DressCode, FIA-VTON achieves state-of-the-art results and demonstrates strong robustness in-the-wild scenarios, indicating practical impact for realistic virtual try-on systems.

Abstract

Image-based virtual try-on is challenging since the generated image should fit the garment to model images in various poses and keep the characteristics and details of the garment simultaneously. A popular research stream warps the garment image firstly to reduce the burden of the generation stage, which relies highly on the performance of the warping module. Other methods without explicit warping often lack sufficient guidance to fit the garment to the model images. In this paper, we propose FIA-VTON, which leverages the implicit warp feature by adopting a Flow Infused Attention module on virtual try-on. The dense warp flow map is projected as indirect guidance attention to enhance the feature map warping in the generation process implicitly, which is less sensitive to the warping estimation accuracy than an explicit warp of the garment image. To further enhance implicit warp guidance, we incorporate high-level spatial attention to complement the dense warp. Experimental results on the VTON-HD and DressCode dataset significantly outperform state-of-the-art methods, demonstrating that FIA-VTON is effective and robust for virtual try-on.

Learning Implicit Features with Flow Infused Attention for Realistic Virtual Try-On

TL;DR

This work tackles realistic virtual try-on by bridging warp-based and learning-based approaches through Flow Infused Attention (FIA-VTON). A dense warp flow map is infused into a diffusion model as implicit guidance, with Flow Guider supplying and Spatial Guider providing high-level garment semantics via FashionCLIP, all integrated through FIA within the Denoising UNet. The method replaces standard cross-attention with flow- and space-guided attention, enabling accurate deformation and texture preservation while reducing warping artifacts. On VITON-HD and DressCode, FIA-VTON achieves state-of-the-art results and demonstrates strong robustness in-the-wild scenarios, indicating practical impact for realistic virtual try-on systems.

Abstract

Image-based virtual try-on is challenging since the generated image should fit the garment to model images in various poses and keep the characteristics and details of the garment simultaneously. A popular research stream warps the garment image firstly to reduce the burden of the generation stage, which relies highly on the performance of the warping module. Other methods without explicit warping often lack sufficient guidance to fit the garment to the model images. In this paper, we propose FIA-VTON, which leverages the implicit warp feature by adopting a Flow Infused Attention module on virtual try-on. The dense warp flow map is projected as indirect guidance attention to enhance the feature map warping in the generation process implicitly, which is less sensitive to the warping estimation accuracy than an explicit warp of the garment image. To further enhance implicit warp guidance, we incorporate high-level spatial attention to complement the dense warp. Experimental results on the VTON-HD and DressCode dataset significantly outperform state-of-the-art methods, demonstrating that FIA-VTON is effective and robust for virtual try-on.

Paper Structure

This paper contains 13 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) Overview of FIA-VTON, illustrating the main components: a pair of VAE Encoder and Decoder, a Garment Net, and a Denoising UNet. The model takes a garment image, a target person image, and mask and pose based control images as inputs. We adopt a Flow Guider to generate dense flow map, and a Spatial Guider to extract global garment characteristics. These features are processed in a Flow Infused Attention module, interactions with Garment Net and Denoising UNet to generate the final try-on output. (b) Illustration of the Flow Infused Attention module. The flow is projected into garment and model feature space, and then fused using a cross-attention mechanism to produce flow-guided features. Then the high-level spatial feature is further integrated by decoupled cross-attention, which captures the garment consistency, details, and high-level characteristics uniformly.
  • Figure 2: Qualitative comparison on VITON-HD dataset choi2021viton. Examples generated by VITON-HD, HR-VTON, GP-VTON, LaDI-VTON, DCI-VTON, StableVITON, $D^4$-VTON and our model. Zoom in for a better view.
  • Figure 3: Qualitative comparison on the DressCode morelli2022dress dataset. FIA-VTON demonstrates a distinct advantage in handling complex textures and Drastic deformation. Please zoom in for more details.
  • Figure 4: Ablation study on our FIA-VTON.
  • Figure 5: Ablation study on Flow Infused Attention (FIA).
  • ...and 1 more figures