Table of Contents
Fetching ...

OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

Yuhao Xu, Tao Gu, Weifeng Chen, Chengcai Chen

TL;DR

<3-5 sentence high-level summary> OOTDiffusion tackles the image-based virtual try-on problem by eliminating the need for explicit garment warping and instead learning garment details in a dedicated outfitting UNet that is fused into a pretrained latent diffusion model's denoising network. The method introduces outfitting fusion within self-attention layers and a training-time outfitting dropout, enabling classifier-free guidance for controllable garment influence. Finetuned on high-resolution VITON-HD and Dress Code data, OOTDiffusion delivers superior realism and garment detail preservation, with strong cross-dataset generalization and robust qualitative and quantitative performance. The approach offers practical potential for e-commerce VTON and is accompanied by publicly released code.</p>

Abstract

We present OOTDiffusion, a novel network architecture for realistic and controllable image-based virtual try-on (VTON). We leverage the power of pretrained latent diffusion models, designing an outfitting UNet to learn the garment detail features. Without a redundant warping process, the garment features are precisely aligned with the target human body via the proposed outfitting fusion in the self-attention layers of the denoising UNet. In order to further enhance the controllability, we introduce outfitting dropout to the training process, which enables us to adjust the strength of the garment features through classifier-free guidance. Our comprehensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, which outperforms other VTON methods in both realism and controllability, indicating an impressive breakthrough in virtual try-on. Our source code is available at https://github.com/levihsu/OOTDiffusion.

OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-on

TL;DR

<3-5 sentence high-level summary> OOTDiffusion tackles the image-based virtual try-on problem by eliminating the need for explicit garment warping and instead learning garment details in a dedicated outfitting UNet that is fused into a pretrained latent diffusion model's denoising network. The method introduces outfitting fusion within self-attention layers and a training-time outfitting dropout, enabling classifier-free guidance for controllable garment influence. Finetuned on high-resolution VITON-HD and Dress Code data, OOTDiffusion delivers superior realism and garment detail preservation, with strong cross-dataset generalization and robust qualitative and quantitative performance. The approach offers practical potential for e-commerce VTON and is accompanied by publicly released code.</p>

Abstract

We present OOTDiffusion, a novel network architecture for realistic and controllable image-based virtual try-on (VTON). We leverage the power of pretrained latent diffusion models, designing an outfitting UNet to learn the garment detail features. Without a redundant warping process, the garment features are precisely aligned with the target human body via the proposed outfitting fusion in the self-attention layers of the denoising UNet. In order to further enhance the controllability, we introduce outfitting dropout to the training process, which enables us to adjust the strength of the garment features through classifier-free guidance. Our comprehensive experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results for arbitrary human and garment images, which outperforms other VTON methods in both realism and controllability, indicating an impressive breakthrough in virtual try-on. Our source code is available at https://github.com/levihsu/OOTDiffusion.
Paper Structure (24 sections, 4 equations, 7 figures, 4 tables)

This paper contains 24 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Outfitted images ($1024\times 768$) generated by our OOTDiffusion trained on the VITON-HD choi2021viton (1st row; supporting upper-body garments) and Dress Code morelli2022dress (2nd row; supporting upper-body garments, lower-body garments and dresses) datasets, with various input human and garment images. Please zoom in for more details.
  • Figure 2: Overview of our proposed OOTDiffusion model. On the left side, the garment image is encoded into the latent space and fed into the outfitting UNet for a single step process. Along with the auxiliary conditioning input generated by CLIP encoders, the garment features are incorporated into the denoising UNet via outfitting fusion. Outfitting dropout is performed for the garment latents particularly in training to enable classifier-free guidance. On the right side, the input human image is masked with respect to the target region and concatenated with a Gaussian noise as the input to the denoising UNet for multiple sampling steps. After denoising, the feature map is decoded back into the image space as our try-on result.
  • Figure 3: Visualization of the attention maps with respect to the human body (1st row) and garment features (2nd row) aligned by our outfitting fusion.
  • Figure 4: Qualitative comparison of outfitted images generated by OOTDiffusion models trained without/with outfitting dropout and using different values of the guidance scale $s_\mathbf{g}$. Please zoom in for more details.
  • Figure 5: Qualitative comparison on the VITON-HD dataset choi2021viton (half-body models with upper-body garments). Please zoom in for more details.
  • ...and 2 more figures