Table of Contents
Fetching ...

D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

Zhaotong Yang, Zicheng Jiang, Xinzhe Li, Huiyu Zhou, Junyu Dong, Huaidong Zhang, Yong Du

TL;DR

This work tackles semantic inconsistencies in image-based virtual try-on and learning ambiguities in diffusion-based synthesis. It introduces Dynamic Semantics Disentangling Modules (DSDMs) to learn semantically disentangled local garment flows and a Differential Diffusion framework with a Differential Information Tracking Path (DITP) to decouple inpainting and denoising in the synthesis stage. The approach yields state-of-the-art results on VITON-HD and DressCode across paired and unpaired settings, with clear improvements in perceptual quality and texture fidelity, validated by quantitative metrics and qualitative analysis. By reducing optimization conflicts and preserving garment textures, D$^4$-VTON offers a robust, low-overhead path toward realistic, semantically consistent virtual try-on suitable for online shopping and related applications.

Abstract

In this paper, we introduce D$^4$-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that D$^4$-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency.

D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On

TL;DR

This work tackles semantic inconsistencies in image-based virtual try-on and learning ambiguities in diffusion-based synthesis. It introduces Dynamic Semantics Disentangling Modules (DSDMs) to learn semantically disentangled local garment flows and a Differential Diffusion framework with a Differential Information Tracking Path (DITP) to decouple inpainting and denoising in the synthesis stage. The approach yields state-of-the-art results on VITON-HD and DressCode across paired and unpaired settings, with clear improvements in perceptual quality and texture fidelity, validated by quantitative metrics and qualitative analysis. By reducing optimization conflicts and preserving garment textures, D-VTON offers a robust, low-overhead path toward realistic, semantically consistent virtual try-on suitable for online shopping and related applications.

Abstract

In this paper, we introduce D-VTON, an innovative solution for image-based virtual try-on. We address challenges from previous studies, such as semantic inconsistencies before and after garment warping, and reliance on static, annotation-driven clothing parsers. Additionally, we tackle the complexities in diffusion-based VTON models when handling simultaneous tasks like inpainting and denoising. Our approach utilizes two key technologies: Firstly, Dynamic Semantics Disentangling Modules (DSDMs) extract abstract semantic information from garments to create distinct local flows, improving precise garment warping in a self-discovered manner. Secondly, by integrating a Differential Information Tracking Path (DITP), we establish a novel diffusion-based VTON paradigm. This path captures differential information between incomplete try-on inputs and their complete versions, enabling the network to handle multiple degradations independently, thereby minimizing learning ambiguities and achieving realistic results with minimal overhead. Extensive experiments demonstrate that D-VTON significantly outperforms existing methods in both quantitative metrics and qualitative evaluations, demonstrating its capability in generating realistic images and ensuring semantic consistency.
Paper Structure (8 sections, 14 equations, 9 figures, 2 tables)

This paper contains 8 sections, 14 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: D$^4$-VTON excels with two innovations: i) Dynamic Semantics Disentangling Modules aggregate abstract semantic information for precise garment warping. ii) A diffusion-based framework integrating a Differential Information Tracking Path reduces learning ambiguities, enhancing accuracy in fitting garment types and body shapes.
  • Figure 2: Overall pipeline of D$^4$-VTON. The deformation network takes the garment image $G$ and conditional triplet $O$ to generate local flows via DSDMs. Utilizing the final flows, we warp the garment features $l_0$ and decode them into the warped garment $G_\omega$, which is then combined with the clothing-agnostic image to create $I^\prime$ as the input for the synthesis network. By tracking differential information via DITP, we separately perform inpainting and denoising on the latents to produce the try-on result $\hat{I}$.
  • Figure 3: Illustration of the Dynamic Semantics Selector, the Group Warping Block, and the training pipeline of the differential diffusion based synthesis network.
  • Figure 4: Qualitative comparison on VITON-HD dataset choi2021viton. Please zoom in for a better view.
  • Figure 5: Qualitative comparison on DressCode morelli2022dress. Categories in each row from top to bottom are upper, lower, and dresses, respectively. Please zoom in for a better view.
  • ...and 4 more figures