Table of Contents
Fetching ...

Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei

TL;DR

This work tackles virtual try-on by taming diffusion stochasticity through explicit visual correspondence. It introduces Semantic Point Matching (SPM-Diff), which samples semantic points on the in-shop garment, maps them to the target person via local flow warping, and augments these 2D cues with depth/normal maps before injecting them into a dual-branch diffusion model. A point-focused diffusion loss further emphasizes accurate reconstruction at semantic points, yielding sharper garment details and better shape preservation. Empirical results on VITON-HD, DressCode, and cross-dataset tests show state-of-the-art garment-detail fidelity and robust generalization, highlighting the practical impact for high-fidelity VTON in e-commerce and related applications.

Abstract

Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.

Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

TL;DR

This work tackles virtual try-on by taming diffusion stochasticity through explicit visual correspondence. It introduces Semantic Point Matching (SPM-Diff), which samples semantic points on the in-shop garment, maps them to the target person via local flow warping, and augments these 2D cues with depth/normal maps before injecting them into a dual-branch diffusion model. A point-focused diffusion loss further emphasizes accurate reconstruction at semantic points, yielding sharper garment details and better shape preservation. Empirical results on VITON-HD, DressCode, and cross-dataset tests show state-of-the-art garment-detail fidelity and robust generalization, highlighting the practical impact for high-fidelity VTON in e-commerce and related applications.

Abstract

Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.

Paper Structure

This paper contains 20 sections, 8 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: Illustration of given target person and (a) in-shape garment with semantic points. Existing GAN-based methods (e.g., (b) GP-VTON) and diffusion-based approaches (e.g., (c) Stable-VTON and (d) OOTDiffusion) often struggle with complex garment texture details and challenging human poses, resulting in a range of artifacts and the lack of necessary texture details. In contrast, (e) our SPM-Diff effectively alleviates these limitations and leads to higher-quality results with better-aligned semantic points, leading to strong visual correspondence and thereby preserving garment detail/shape.
  • Figure 2: The overall framework of our SPM-Diff. (a) Illustration of our semantic point matching (SPM). In SPM, a set of semantic points on the garment are first sampled and matched to the corresponding points on the target person via local flow warping. Then, these 2D cues are augmented into 3D-aware cues with depth/normal map, which act as semantic point matching to supervise diffusion model. (b) Dual-branch framework includes Garm-UNet and Main-UNet for garment feature learning and image generation, respectively. Note that Main-UNet is upgraded with SPM for high-fidelity synthesis in our SPM-Diff. (c) Visualization of garment-to-person point correspondence.
  • Figure 3: Qualitative results on the VITON-HD dataset.
  • Figure 4: User study on 100 garment-person pairs randomly sampled from VITON-HD dataset.
  • Figure 5: Accuracy of visual correspondence between in-shop garment and synthesized person image.
  • ...and 12 more figures