Table of Contents
Fetching ...

ODPG: Outfitting Diffusion with Pose Guided Condition

Seohyun Lee, Jintae Park, Sanghyeok Park

TL;DR

ODPG tackles dynamic virtual try-on by leveraging a latent diffusion model conditioned on three inputs: target garment, target pose, and source appearance. It introduces multi-scale feature extraction and a bias-augmented query attention mechanism to integrate garment, pose, and appearance within a UNet, enabling non-explicit garment synthesis without warping. The model achieves strong realism and texture fidelity on FashionTryOn and a subset of DeepFashion, while reducing data dependency through end-to-end training and classifier-free guidance. This approach simplifies the VTON pipeline and paves the way for video VTON and cross-domain applications in data-limited settings.

Abstract

Virtual Try-On (VTON) technology allows users to visualize how clothes would look on them without physically trying them on, gaining traction with the rise of digitalization and online shopping. Traditional VTON methods, often using Generative Adversarial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses. This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process. By transforming garment, pose, and appearance images into latent features and integrating these features in a UNet-based denoising model, ODPG achieves non-explicit synthesis of garments on dynamically posed human images. Our experiments on the FashionTryOn and a subset of the DeepFashion dataset demonstrate that ODPG generates realistic VTON images with fine-grained texture details across various poses, utilizing an end-to-end architecture without the need for explicit garment warping processes. Future work will focus on generating VTON outputs in video format and on applying our attention mechanism, as detailed in the Method section, to other domains with limited data.

ODPG: Outfitting Diffusion with Pose Guided Condition

TL;DR

ODPG tackles dynamic virtual try-on by leveraging a latent diffusion model conditioned on three inputs: target garment, target pose, and source appearance. It introduces multi-scale feature extraction and a bias-augmented query attention mechanism to integrate garment, pose, and appearance within a UNet, enabling non-explicit garment synthesis without warping. The model achieves strong realism and texture fidelity on FashionTryOn and a subset of DeepFashion, while reducing data dependency through end-to-end training and classifier-free guidance. This approach simplifies the VTON pipeline and paves the way for video VTON and cross-domain applications in data-limited settings.

Abstract

Virtual Try-On (VTON) technology allows users to visualize how clothes would look on them without physically trying them on, gaining traction with the rise of digitalization and online shopping. Traditional VTON methods, often using Generative Adversarial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses. This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process. By transforming garment, pose, and appearance images into latent features and integrating these features in a UNet-based denoising model, ODPG achieves non-explicit synthesis of garments on dynamically posed human images. Our experiments on the FashionTryOn and a subset of the DeepFashion dataset demonstrate that ODPG generates realistic VTON images with fine-grained texture details across various poses, utilizing an end-to-end architecture without the need for explicit garment warping processes. Future work will focus on generating VTON outputs in video format and on applying our attention mechanism, as detailed in the Method section, to other domains with limited data.
Paper Structure (22 sections, 8 equations, 5 figures, 1 table)

This paper contains 22 sections, 8 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overall Pipeline of ODPG : The source image features are extracted at three different scales using a Swin Transformer. Similarly, the garment image passes through a Swin Transformer, with features extracted at four scales. The fourth-scale feature is passed through a transformer decoder with learnable queries, which outputs a general garment information to condition the diffusion UNet structure at each downsample and upsample block. Multi-scale features from both the source and garment images are combined via attention mechanisms, with fine details added as conditions in the upsampling blocks. Additionally, pose information, processed through a ResNet, provides coarse guidance, conditioning the downsampling blocks.
  • Figure 2: Qualitative results showcasing the performance of our model. The images illustrate the ability of our model to retrieve relevant items across different poses and domains.
  • Figure 3: Comparison of different models: IMAGDressing, IDM-VTON, OOTD, and our ODPG model.
  • Figure 4: The results of five experiments demonstrating the impact of different input variations to the appearance encoder. The experiments include (i) appearance only, (ii) 2x appearance, (iii) garment only, (iv) 2x garment, and (v) appearance + garment. The figure highlights how additional bias in queries improves the focus of the features, with the best results achieved by balancing both appearance and garment inputs.
  • Figure 5: Impact of gray masking on upper body segmentation, showing how masking removes garment details and contaminates appearance information, leading to distorted color and patterns in the output due to the cross-attention mechanism using biased queries from the source and garment images.