Table of Contents
Fetching ...

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

Hung Nguyen, Quang Qui-Vinh Nguyen, Khoi Nguyen, Rang Nguyen

TL;DR

SwiftTry tackles the problem of temporally coherent video virtual try-on by extending diffusion-based image inpainting with temporal attention to maintain frame-to-frame consistency. It introduces ShiftCaching to reduce redundant computation across long videos by processing non-overlapping chunks that are shifted by $\Delta$ per denoising step and using partial computations with a Masked Temporal Attention scheme. A high-quality TikTokDress dataset provides diverse backgrounds and complex motions with detailed garment masks and pose annotations to support training and evaluation. Empirical results on VVT and TikTokDress show improved video consistency (VFID, SSIM, LPIPS) and higher inference speed (up to $2.27$ FPS, with approximately $1.5\times$ speedup from ShiftCaching) over prior methods, highlighting practical impact for real-world fashion applications.

Abstract

Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. The project page is available at https://swift-try.github.io/.

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models

TL;DR

SwiftTry tackles the problem of temporally coherent video virtual try-on by extending diffusion-based image inpainting with temporal attention to maintain frame-to-frame consistency. It introduces ShiftCaching to reduce redundant computation across long videos by processing non-overlapping chunks that are shifted by per denoising step and using partial computations with a Masked Temporal Attention scheme. A high-quality TikTokDress dataset provides diverse backgrounds and complex motions with detailed garment masks and pose annotations to support training and evaluation. Empirical results on VVT and TikTokDress show improved video consistency (VFID, SSIM, LPIPS) and higher inference speed (up to FPS, with approximately speedup from ShiftCaching) over prior methods, highlighting practical impact for real-world fashion applications.

Abstract

Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the specified garment while maintaining spatiotemporal consistency. Although significant advances have been made in image-based virtual try-on, extending these successes to video often leads to frame-to-frame inconsistencies. Some approaches have attempted to address this by increasing the overlap of frames across multiple video chunks, but this comes at a steep computational cost due to the repeated processing of the same frames, especially for long video sequences. To tackle these challenges, we reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions. Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence. To reduce computational overhead, we propose ShiftCaching, a novel technique that maintains temporal consistency while minimizing redundant computations. Furthermore, we introduce the TikTokDress dataset, a new video try-on dataset featuring more complex backgrounds, challenging movements, and higher resolution compared to existing public datasets. Extensive experiments demonstrate that our approach outperforms current baselines, particularly in terms of video consistency and inference speed. The project page is available at https://swift-try.github.io/.

Paper Structure

This paper contains 17 sections, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Results of our SwiftTry compared to those of ViViD fang2024vivid, a previous method for video try-on. Our method preserves garment texture detail and consistency while achieving over 60% faster runtime.
  • Figure 2: Overview of Stage 2 of our SwiftTry framework (Note that stage 1 is similar, except the input is a single image frame, and it does not include temporal attention layers). Given an input video and a garment image, our method first extracts the masked video, corresponding masks, and pose sequence. The masked video is encoded into the latent space by the VAE Encoder, which is then concatenated with noise, masks, and pose features before being processed by the Main U-Net. To inpaint the garment during the denoising process, we use a Garment U-Net and a CLIP encoder to extract both low- and high-level garment features. These features are integrated into the Main U-Net through spatial and cross-attention mechanisms.
  • Figure 3: Illustration of fully and partially computed frames with a chunk size of $N = 8$ and a shift of $\Delta = 4$. In the partially computed chunk, one half uses cached features from $t+2$, while the other half uses features from $t+1$.
  • Figure 4: Comparison between a fully computed frame and a partially computed frame. The partially computed frame employs Masked Temporal Attention instead of standard Temporal Attention to resolve mismatches in cached features.
  • Figure 5: Example videos from the TikTokDress dataset highlighting diversity in skin tones, genders, camera angles, and clothing types.
  • ...and 9 more figures