ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text
Haifeng Ni, Ming Xu
TL;DR
ITVTON tackles the challenge of realistic and efficient virtual try-on by using a single Diffusion Transformer as the generator, avoiding extra image encoders through width-wise concatenation of garment and person inputs and integrated image-text prompts. The method restricts training to the attention parameters of a Single-DiT block, achieving a favorable balance between fidelity and computational cost. Experiments on VITON-HD and IGPair show state-of-the-art performance across qualitative and quantitative metrics, with robustness in challenging scenes. The approach yields a compact, parameter-efficient model with 1,076.2M trainable parameters and demonstrates strong potential for real-world online fashion applications.
Abstract
Virtual try-on, which aims to seamlessly fit garments onto person images, has recently seen significant progress with diffusion-based models. However, existing methods commonly resort to duplicated backbones or additional image encoders to extract garment features, which increases computational overhead and network complexity. In this paper, we propose ITVTON, an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity. By concatenating garment and person images along the width dimension and incorporating textual descriptions from both, ITVTON effectively captures garment-person interactions while preserving realism. To further reduce computational cost, we restrict training to the attention parameters within a single Diffusion Transformer (Single-DiT) block. Extensive experiments demonstrate that ITVTON surpasses baseline methods both qualitatively and quantitatively, setting a new standard for virtual try-on. Moreover, experiments on 10,257 image pairs from IGPair confirm its robustness in real-world scenarios.
