Table of Contents
Fetching ...

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

Jun Zheng, Fuwei Zhao, Youjiang Xu, Xin Dong, Xiaodan Liang

TL;DR

This work tackles video virtual try-on in unconstrained real-world settings by introducing VITON-DiT, a DiT-based framework that fuses a Spatio-Temporal Denoising DiT, a Garment Extractor, and an Identity-Preservation ControlNet through an attention fusion mechanism to preserve clothing details and identity in long, unpaired video sequences. It leverages unpaired dance videos in a multi-stage self-supervised training regime and employs random conditioning and Interpolated Auto-Regressive inference to enable tens-of-seconds video generation with temporal consistency. The approach achieves competitive image quality while delivering superior video fidelity and weaving garment textures faithfully into dynamic human motion, outperforming GAN-based and UNet-based diffusion baselines. The paper also provides a new real-world benchmark and demonstrates the model's data scalability and robustness to challenging poses and backgrounds, suggesting a practical path toward scalable, in-the-wild video try-on systems.

Abstract

Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.

VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers

TL;DR

This work tackles video virtual try-on in unconstrained real-world settings by introducing VITON-DiT, a DiT-based framework that fuses a Spatio-Temporal Denoising DiT, a Garment Extractor, and an Identity-Preservation ControlNet through an attention fusion mechanism to preserve clothing details and identity in long, unpaired video sequences. It leverages unpaired dance videos in a multi-stage self-supervised training regime and employs random conditioning and Interpolated Auto-Regressive inference to enable tens-of-seconds video generation with temporal consistency. The approach achieves competitive image quality while delivering superior video fidelity and weaving garment textures faithfully into dynamic human motion, outperforming GAN-based and UNet-based diffusion baselines. The paper also provides a new real-world benchmark and demonstrates the model's data scalability and robustness to challenging poses and backgrounds, suggesting a practical path toward scalable, in-the-wild video try-on systems.

Abstract

Video try-on stands as a promising area for its tremendous real-world potential. Prior works are limited to transferring product clothing images onto person videos with simple poses and backgrounds, while underperforming on casually captured videos. Recently, Sora revealed the scalability of Diffusion Transformer (DiT) in generating lifelike videos featuring real-world scenarios. Inspired by this, we explore and propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. To faithfully recover the clothing details, the extracted garment features are fused with the self-attention outputs of the denoising DiT and the ControlNet. We also introduce novel random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation. Unlike existing attempts that require the laborious and restrictive construction of a paired training dataset, severely limiting their scalability, VITON-DiT alleviates this by relying solely on unpaired human dance videos and a carefully designed multi-stage training strategy. Furthermore, we curate a challenging benchmark dataset to evaluate the performance of casual video try-on. Extensive experiments demonstrate the superiority of VITON-DiT in generating spatio-temporal consistent try-on results for in-the-wild videos with complicated human poses.
Paper Structure (20 sections, 7 equations, 6 figures, 4 tables)

This paper contains 20 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Video try-on results of the proposed VITON-DiT. Our model is capable of generalization across diverse types of clothing, even those non-product garments that may have flaws. It can also deal with complex body movements such as dancing against real world backgrounds.
  • Figure 2: Overview of the proposed VITON-DiT. (a) The architecture contains three components with the following tasks. (1) Denoising DiT: generating latent representation of video contents via a chain of Spatio-Temporal (ST-) DiT blocks. (2) ID ControlNet: producing feature residual for the Denoising DiT to preserve the reference person's identity, pose, and background. (3) Garment Extractor: obtaining and delivering garment features into the Denoising DiT and the ControlNet via attention fusion, thus recovering detailed clothing textures in the generated try-on video. (b) Illustrated Attention Fusion: integrating person denoising features and extracted garment features using addictive attention. This operation is utilized in both the Denoising DiT and the ID ControlNet.
  • Figure 3: Strategies for long video generation. (a) Random agnostic condition swap: randomly replacing agnostic images and inpainting masks with corresponding ground-truth and all-zero masks. (b) IAR inference: generating key-frames within each divided sequence, followed by an AR inference that fills missing frames. Note that random swap training is the prerequisite for the IAR inference.
  • Figure 4: Qualitative comparison with baselines. Our VITON-DiT outperforms other baselines in terms of consistent preservation of garment shape and color, as well as stable clothing-person alignment cross varing camera distances.
  • Figure 5: Ablation study on the quantity of data. It is clear that as the quality and quantity of data increases, the model's visual performance also gradually improves accordingly.
  • ...and 1 more figures