Table of Contents
Fetching ...

PFDM: Parser-Free Virtual Try-on via Diffusion Model

Yunfang Niu, Dong Yi, Lingxiang Wu, Zhiwei Liu, Pengxiang Cai, Jinqiao Wang

TL;DR

PFDM addresses the dependency on parsing in virtual try-on by introducing a diffusion-based, parser-free framework that implicitly warps garments onto a target person in latent space. The method uses pseudo-inputs generated from multiple parser-based models and a Garment Fusion Attention module to fuse garment and person features, guided by classifier-free diffusion sampling. It achieves state-of-the-art performance on high-resolution VITON-HD data, producing high-fidelity textures and robust results under occlusion and misalignment. This work reduces reliance on parsing data, enabling high-quality, one-step virtual try-on suitable for real-world e-commerce and metaverse applications.

Abstract

Virtual try-on can significantly improve the garment shopping experiences in both online and in-store scenarios, attracting broad interest in computer vision. However, to achieve high-fidelity try-on performance, most state-of-the-art methods still rely on accurate segmentation masks, which are often produced by near-perfect parsers or manual labeling. To overcome the bottleneck, we propose a parser-free virtual try-on method based on the diffusion model (PFDM). Given two images, PFDM can "wear" garments on the target person seamlessly by implicitly warping without any other information. To learn the model effectively, we synthesize many pseudo-images and construct sample pairs by wearing various garments on persons. Supervised by the large-scale expanded dataset, we fuse the person and garment features using a proposed Garment Fusion Attention (GFA) mechanism. Experiments demonstrate that our proposed PFDM can successfully handle complex cases, synthesize high-fidelity images, and outperform both state-of-the-art parser-free and parser-based models.

PFDM: Parser-Free Virtual Try-on via Diffusion Model

TL;DR

PFDM addresses the dependency on parsing in virtual try-on by introducing a diffusion-based, parser-free framework that implicitly warps garments onto a target person in latent space. The method uses pseudo-inputs generated from multiple parser-based models and a Garment Fusion Attention module to fuse garment and person features, guided by classifier-free diffusion sampling. It achieves state-of-the-art performance on high-resolution VITON-HD data, producing high-fidelity textures and robust results under occlusion and misalignment. This work reduces reliance on parsing data, enabling high-quality, one-step virtual try-on suitable for real-world e-commerce and metaverse applications.

Abstract

Virtual try-on can significantly improve the garment shopping experiences in both online and in-store scenarios, attracting broad interest in computer vision. However, to achieve high-fidelity try-on performance, most state-of-the-art methods still rely on accurate segmentation masks, which are often produced by near-perfect parsers or manual labeling. To overcome the bottleneck, we propose a parser-free virtual try-on method based on the diffusion model (PFDM). Given two images, PFDM can "wear" garments on the target person seamlessly by implicitly warping without any other information. To learn the model effectively, we synthesize many pseudo-images and construct sample pairs by wearing various garments on persons. Supervised by the large-scale expanded dataset, we fuse the person and garment features using a proposed Garment Fusion Attention (GFA) mechanism. Experiments demonstrate that our proposed PFDM can successfully handle complex cases, synthesize high-fidelity images, and outperform both state-of-the-art parser-free and parser-based models.
Paper Structure (9 sections, 8 equations, 3 figures, 4 tables)

This paper contains 9 sections, 8 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The architecture of proposed parser-free try-on method which contains two pipelines. First, we generate pseudo images $\widetilde{P}$ of the person wearing another unpaired garment based on a model hub. Then, a well-designed diffusion model is trained to generate the try-on result $\hat{P}$.
  • Figure 2: The Garment Fusion Attention Module. Given $f_i^P, f_j^G$, the enhanced attention is applied to fuse inner features of the person and the garment into output $f_{i+1}^P$.
  • Figure 3: Visual Comparison of PFDM and other competitive methods. From left to right: the reference person and the in-shop garment, images generated by HR-VITON lee2022high, Ladi-VTON morelli2023ladi, PFDM (ours).