Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Yisol Choi; Sangkyung Kwak; Kyungmin Lee; Hyungwon Choi; Jinwoo Shin

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, Jinwoo Shin

TL;DR

This work addresses the challenge of authentic image-based virtual try-on under in-the-wild conditions by introducing IDM--VTON, a diffusion-model-based framework that conditions garment information through two modules: an Image Prompt Adapter for high-level garment semantics and GarmentNet for low-level garment details. It also leverages detailed garment captions and a customizable fine-tuning strategy (decoder-focused) to adapt to new garment-person pairs, significantly improving garment fidelity and realism compared with GAN-based and prior diffusion-based methods. The approach achieves state-of-the-art results on public datasets (VITON-HD, DressCode) and demonstrates strong generalization in-the-wild, with customization offering substantial gains in identity preservation and authenticity. Overall, IDM--VTON advances practical VTON applications by marrying rich garment conditioning with diffusion priors, while also acknowledging potential misuse and remaining limitations in handling certain skin attributes.

Abstract

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario. More visualizations are available in our project page: https://idm-vton.github.io

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

TL;DR

Abstract

Paper Structure (39 sections, 4 equations, 18 figures, 7 tables)

This paper contains 39 sections, 4 equations, 18 figures, 7 tables.

Introduction
Related Works
Image-based virtual try-on.
Adding conditional control to diffusion models.
Customizing diffusion models.
Method
Backgrounds on Diffusion Models
Text-to-image (T2I) diffusion models.
Image prompt adapter ye2023ip.
Proposed Method
TryonNet.
Image prompt adapter.
GarmentNet.
Detailed captioning of garments.
Customization of IDM--VTON.
...and 24 more sections

Figures (18)

Figure 1: Virtual try-on images generated by using our IDM--VTON on VITON-HD choi2021viton (top row, first and second column), DressCode morelli2022dress (top row, third and fourth column), and collected In-the-Wild dataset (bottom row). Best viewed on a zoomed, color monitor.
Figure 2: Overview of IDM--VTON. We demonstrate the proposed model architecture and details on the attention modules. (Left) Our model consists of 1) TryonNet which is a main UNet that processes person image, 2) image prompt adapter (IP-Adapter) ye2023ip that encodes high-level semantics of garment image $\mathbf{x}_g$, and 3) GarmentNet that encodes low-level features of $\mathbf{x}_g$. As of input for UNet, we concatenate the noised latents $\mathbf{x}_t$ of latents $\mathcal{E}(\mathbf{x}_p)$ with the segmentation mask $\mathbf{m}$, masked image $\mathcal{E}(\mathbf{x}_m)$, and Densepose guler2018densepose$\mathcal{E}(\mathbf{x}_{\textrm{pose}})$. We provide a detailed caption to the garment (e.g., [V]: "short sleeve round neck t-shirts"). Then it is used for input prompt of GarmentNet (e.g., "A photo of [V]") and TryonNet (e.g., "Model is wearing [V]"). (Right) The intermediate features of TryonNet and GarmentNet are concatenated and passed to the self-attention layer, and we use the first half (i.e., that from TryonNet) of the output. Then we fuse the output with features from text encoder and IP-Adapter by cross-attention layer. We fine-tune the TryonNet and the IP-Adapter modules, and freeze other components.
Figure 3: Comparisons between datasets used in our experiments. For evaluation, we test on (a) public dataset, including VITON-HD choi2021viton and DressCode morelli2022dress, and (b) In-the-Wild dataset, which we internally collected from real E-commerce setup. We remark that the In-the-Wild dataset contains more intricate patterns and logos in garment image, and diverse backgrounds, and poses in person image.
Figure 4: Qualitative results on VITON-HD and DressCode dataset. We show generated virtual try-on images using IDM--VTON (ours) compared with other methods on (a) VITON-HD choi2021viton, and (b) DressCode (upper body) morelli2022dress test datasets. We see that IDM--VTON outperforms others in generating authentic images and preserving fine-grained details of garment. Best viewed in zoomed, color monitor.
Figure 5: Qualitative comparisons on In-the-Wild dataset. We show generated virtual try-on images on In-the-Wild dataset using IDM--VTON (ours) compared with other methods. IDM--VTON outperforms other methods in generating authentic images and preserving fine-grained details of garment. In particular, customizing IDM--VTON, (i.e., IDM--VTON$^\ddag$), significantly enhances the image quality and garment fidelity. When applying customization on StableVITON, (i.e., StableVITON$^\ddag$), the improvements are marginal compared to ours. Best viewed in zoomed, color monitor.
...and 13 more figures

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

TL;DR

Abstract

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

Authors

TL;DR

Abstract

Table of Contents

Figures (18)