Table of Contents
Fetching ...

Fashion-VDM: Video Diffusion Model for Virtual Try-On

Johanna Karras, Yingwei Li, Nan Liu, Luyang Zhu, Innfarn Yoo, Andreas Lugmayr, Chris Lee, Ira Kemelmacher-Shlizerman

TL;DR

Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos that aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person’s identity and motion is presented.

Abstract

We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: https://johannakarras.github.io/Fashion-VDM.

Fashion-VDM: Video Diffusion Model for Virtual Try-On

TL;DR

Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos that aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person’s identity and motion is presented.

Abstract

We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: https://johannakarras.github.io/Fashion-VDM.

Paper Structure

This paper contains 37 sections, 2 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Fashion-VDM Architecture. Given a noisy video $z_t$ at diffusion timestep $t$, a forward pass of Fashion-VDM computes a single denoising step to get the denoised video $z'_{t-1}$. Noisy video $z_t$ is preprocessed into person poses $J_p$ and clothing-agnostic frames $I_a$, while the garment image $I_g$ is preprocessed into the garment segmentation $S_g$ and garment poses $J_g$ (Section \ref{['ssec:input-preprocessing']}). The architecture follows mixmatch, except the main UNet contains 3D-Conv and temporal attention blocks to maintain temporal consistency. Additionally, we inject temporal down/upsampling blocks during 64-frame temporal training. Noisy video $z_t$ is encoded by the main UNet and the conditioning signals, $S_g$ and $I_a$, are encoded by separate UNet encoders. In the 8 DiT blocks at the lowest resolution of the UNet, the garment conditioning features are cross-attended with the noisy video features and the spatially-aligned clothing-agnostic features $z_a$ and noisy video features are directly concatenated. $J_g$ and $J_p$ are encoded by single linear layers, then concatenated to the noisy features in all UNet 2D spatial layers.
  • Figure 2: Split-CFG Ablation. We compare different split-cfg weights, where $(w_{\emptyset}, w_\text{p}, w_\text{g}, w_\text{full})$ correspond to the unconditional guidance, person-only guidance, person and cloth guidance, and full guidance terms, respectively.
  • Figure 3: Joint Training Ablation. Joint image and video training improves the realism of occluded views.
  • Figure 4: Garment Fidelity Ablations. We compare our full model with ablated versions without split-CFG and without joint image-video training in terms of garment fidelity. Both split-CFG and joint image-video training improve fine-grain garment details (top row) and novel view generation (bottom row).
  • Figure 5: Temporal Smoothness Ablations. We compare video frames generated by our ablated model without temporal blocks (top row) and without progressive training (middle row) to our full model (bottom row). Both ablated versions exhibit large frame-to-frame inconsistencies and artifacts.
  • ...and 7 more figures