Table of Contents
Fetching ...

ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text

Haifeng Ni, Ming Xu

TL;DR

ITVTON tackles the challenge of realistic and efficient virtual try-on by using a single Diffusion Transformer as the generator, avoiding extra image encoders through width-wise concatenation of garment and person inputs and integrated image-text prompts. The method restricts training to the attention parameters of a Single-DiT block, achieving a favorable balance between fidelity and computational cost. Experiments on VITON-HD and IGPair show state-of-the-art performance across qualitative and quantitative metrics, with robustness in challenging scenes. The approach yields a compact, parameter-efficient model with 1,076.2M trainable parameters and demonstrates strong potential for real-world online fashion applications.

Abstract

Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.

ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text

TL;DR

ITVTON tackles the challenge of realistic and efficient virtual try-on by using a single Diffusion Transformer as the generator, avoiding extra image encoders through width-wise concatenation of garment and person inputs and integrated image-text prompts. The method restricts training to the attention parameters of a Single-DiT block, achieving a favorable balance between fidelity and computational cost. Experiments on VITON-HD and IGPair show state-of-the-art performance across qualitative and quantitative metrics, with robustness in challenging scenes. The approach yields a compact, parameter-efficient model with 1,076.2M trainable parameters and demonstrates strong potential for real-world online fashion applications.

Abstract

Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.

Paper Structure

This paper contains 15 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The ITVTON model is utilized to create virtual try-on images derived from the VITON-HD Choi_2021_CVPR dataset (row 1) and the filtered IGPair shen2024imagdressingv1customizablevirtualdressing dataset (row 2). For optimal visual evaluation, it is recommended to examine the images in enlarged form.
  • Figure 2: Overview of ITVTON. Our approach achieves high-quality virtual try-on by concatenating the garment image with the target person image along the width dimension and incorporating integrated image-text representations(as illustrated in the figure). Only the attention parameters in the Single-DiT module remain learnable during training, ensuring a streamlined and efficient try-on network.
  • Figure 3: Qualitative comparison on the VITON-HD Choi_2021_CVPR Dataset. ITVTON exhibits significant advantages in processing complex patterns and text. We recommend zooming in for a detailed inspection.
  • Figure 4: Qualitative comparison in field scenarios demonstrates that our method generates more natural try-on effects, even in complex scenes and with varied postures.
  • Figure 5: Comparison of the effects of setting different guidance scales during training. The model with the guidance scale set to 30 during inference and the guidance scale set to 2 or 30 during training does not perform well when generating garments with text or complex patterns.
  • ...and 1 more figures