Table of Contents
Fetching ...

IMAGDressing-v1: Customizable Virtual Dressing

Fei Shen, Xin Jiang, Xin He, Hu Ye, Cong Wang, Xiaoyu Du, Zechao Li, Jinhui Tang

TL;DR

The paper defines virtual dressing (VD) to enable comprehensive, editable garment displays for merchants, addressing limitations of consumer-focused virtual try-on (VTON). It introduces IMAGDressing-v1, a latent-diffusion framework that uses a trainable Garment UNet to encode semantic/texture garment features and a frozen denoising UNet fused via a Hybrid Attention module, enabling text-driven scene control and compatibility with plug-ins like ControlNet and IP-Adapter. To support data-hungry learning and evaluation, the authors release IGPair, a large-scale garment-pair dataset with 324,857 image pairs, high-resolution imagery, pose/dense pose/segmentation data, and captions, along with the comprehensive CAMI metrics (CAMI-U and CAMI-S) for garment-consistency assessment. Empirical results show state-of-the-art performance under controlled conditions, with ablations validating the contributions of the image encoder branch and hybrid attention, and demonstrations of VTON extensions and data-driven applications using IGPair.

Abstract

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional faces, poses, and scenes. To address this issue, we define a virtual dressing (VD) task focused on generating freely editable human images with fixed garments and optional conditions. Meanwhile, we design a comprehensive affinity metric index (CAMI) to evaluate the consistency between generated images and reference garments. Then, we propose IMAGDressing-v1, which incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE. We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet, ensuring users can control different scenes through text. IMAGDressing-v1 can be combined with other extension plugins, such as ControlNet and IP-Adapter, to enhance the diversity and controllability of generated images. Furthermore, to address the lack of data, we release the interactive garment pairing (IGPair) dataset, containing over 300,000 pairs of clothing and dressed images, and establish a standard pipeline for data assembly. Extensive experiments demonstrate that our IMAGDressing-v1 achieves state-of-the-art human image synthesis performance under various controlled conditions. The code and model will be available at https://github.com/muzishen/IMAGDressing.

IMAGDressing-v1: Customizable Virtual Dressing

TL;DR

The paper defines virtual dressing (VD) to enable comprehensive, editable garment displays for merchants, addressing limitations of consumer-focused virtual try-on (VTON). It introduces IMAGDressing-v1, a latent-diffusion framework that uses a trainable Garment UNet to encode semantic/texture garment features and a frozen denoising UNet fused via a Hybrid Attention module, enabling text-driven scene control and compatibility with plug-ins like ControlNet and IP-Adapter. To support data-hungry learning and evaluation, the authors release IGPair, a large-scale garment-pair dataset with 324,857 image pairs, high-resolution imagery, pose/dense pose/segmentation data, and captions, along with the comprehensive CAMI metrics (CAMI-U and CAMI-S) for garment-consistency assessment. Empirical results show state-of-the-art performance under controlled conditions, with ablations validating the contributions of the image encoder branch and hybrid attention, and demonstrations of VTON extensions and data-driven applications using IGPair.

Abstract

Latest advances have achieved realistic virtual try-on (VTON) through localized garment inpainting using latent diffusion models, significantly enhancing consumers' online shopping experience. However, existing VTON technologies neglect the need for merchants to showcase garments comprehensively, including flexible control over garments, optional faces, poses, and scenes. To address this issue, we define a virtual dressing (VD) task focused on generating freely editable human images with fixed garments and optional conditions. Meanwhile, we design a comprehensive affinity metric index (CAMI) to evaluate the consistency between generated images and reference garments. Then, we propose IMAGDressing-v1, which incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE. We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet, ensuring users can control different scenes through text. IMAGDressing-v1 can be combined with other extension plugins, such as ControlNet and IP-Adapter, to enhance the diversity and controllability of generated images. Furthermore, to address the lack of data, we release the interactive garment pairing (IGPair) dataset, containing over 300,000 pairs of clothing and dressed images, and establish a standard pipeline for data assembly. Extensive experiments demonstrate that our IMAGDressing-v1 achieves state-of-the-art human image synthesis performance under various controlled conditions. The code and model will be available at https://github.com/muzishen/IMAGDressing.
Paper Structure (21 sections, 6 equations, 12 figures, 3 tables)

This paper contains 21 sections, 6 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Differences between virtual try-on and virtual dressing tasks in conditions and applicable scenarios.
  • Figure 2: Sample pairs from the IGPair dataset, including pose keypoints, dense poses, and human body segmentation masks. More sample refer to supplementary material.
  • Figure 3: Illustration of the proposed IMAGDressing-v1 framework. It mainly consists of a trainable garment UNet and a frozen denoising UNet. The former extracts fine-grained garment features, while the latter balances these features with text prompts. IMAGDressing-v1 is compatible with other community modules, such as ControlNet and IP-Adapter.
  • Figure 4: Qualitative comparison with other SOTA methods under both unspecific and specific conditions, including BLIP-Diffusion BLIP-Diffusion, Versatile Diffusion VersatileDiffusion, IP-Adapter ipadapter, and MagicClothing magiccloth.
  • Figure 5: Ablation study of each component.
  • ...and 7 more figures