Table of Contents
Fetching ...

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

Zijian He, Peixin Chen, Guangrun Wang, Guanbin Li, Philip H. S. Torr, Liang Lin

TL;DR

WildVidFit tackles video virtual try-on in unconstrained environments by reframing the task as image-based video generation conditioned on garment and human motion. It introduces a one-stage image try-on network built on a diffusion model, coupled with a diffusion-guidance module that enforces temporal coherence through priors from VideoMAE and DINO-V2, and enhanced by an EMASC autoencoder and fully cross-frame attention. Training is conducted on still images, avoiding the need for heavy video-specific temporal modules, yet inference-time guidance yields fluid, coherent videos across datasets such as VITON-HD, DressCode, VVT, and TikTok. The approach achieves state-of-the-art image virtual try-on metrics and competitive video results, demonstrating robust garment transfer and temporal stability in wild scenarios with practical implications for e-commerce and social media applications.

Abstract

Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos. The project page website is at wildvidfit-project.github.io.

WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models

TL;DR

WildVidFit tackles video virtual try-on in unconstrained environments by reframing the task as image-based video generation conditioned on garment and human motion. It introduces a one-stage image try-on network built on a diffusion model, coupled with a diffusion-guidance module that enforces temporal coherence through priors from VideoMAE and DINO-V2, and enhanced by an EMASC autoencoder and fully cross-frame attention. Training is conducted on still images, avoiding the need for heavy video-specific temporal modules, yet inference-time guidance yields fluid, coherent videos across datasets such as VITON-HD, DressCode, VVT, and TikTok. The approach achieves state-of-the-art image virtual try-on metrics and competitive video results, demonstrating robust garment transfer and temporal stability in wild scenarios with practical implications for e-commerce and social media applications.

Abstract

Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions, limiting their effectiveness in video try-on applications. Moreover, video-based models require extensive, high-quality data and substantial computational resources. To tackle these issues, we reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach. This model, conditioned on specific garments and individuals, is trained on still images rather than videos. It leverages diffusion guidance from pre-trained models including a video masked autoencoder for segment smoothness improvement and a self-supervised model for feature alignment of adjacent frame in the latent space. This integration markedly boosts the model's ability to maintain temporal coherence, enabling more effective video try-on within an image-based framework. Our experiments on the VITON-HD and DressCode datasets, along with tests on the VVT and TikTok datasets, demonstrate WildVidFit's capability to generate fluid and coherent videos. The project page website is at wildvidfit-project.github.io.
Paper Structure (18 sections, 8 equations, 12 figures, 5 tables)

This paper contains 18 sections, 8 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Examples of our virtual try-on results on real-life TikTok videos.
  • Figure 2: Overview of our WildVidFit framework. Our method contains two modules, i.e., a one-stage image try-on network and a guidance module. In timestep $t$, we crop the garment area and decode the latent $Z_t$ into sequence $\mathbf{I_t}$. The similarity loss $L_{SIM}$ is calculated between adjacent frames $I^{j+1}_t$ and $I^j_t$ using spherical distance. Additionally, we randomly mask the sequence $\mathbf{I_t}$ into $\hat{\mathbf{I}}_t$, which is then inputted into VideoMAE for reconstruction. $L_{MAE}$ represents the distance between the sequences $\mathbf{I_t}$ and $\hat{\mathbf{I}}_t$. We assume that a lower reconstruction loss will result in a smoother sequence. $L_{SIM}$ and $L_{MAE}$ together constitute the temporal loss, which controls the sampling process from $Z_t$ to $Z_{t-1}$.
  • Figure 3: Overview of the proposed one-stage image try-on network. First, we extract the person representation and garment representation during preprocessing. The person representation includes the cloth-agnostic image $A$ and the human pose $P$ while the garment representation includes the garment image $G$ and the edge map $E_g$. Then two representations condition the diffusion model in the way of hierarchical fusion in UNet decoder and cross attention respectively.
  • Figure 4: Qualitative comparison on VITON-HD dataset. Zoom in for best view.
  • Figure 5: Cross-dataset video try-on results, given a reference video from TikTok dataset and a garment item from DressCode (1st row) and VITON-HD (2nd row) dataset. Zoom in for optimal viewing.
  • ...and 7 more figures