Table of Contents
Fetching ...

ViViD: Video Virtual Try-on using Diffusion Models

Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha

TL;DR

ViViD presents a diffusion-model-based framework for video virtual try-on that addresses temporal inconsistency and garment-detail preservation through a Garment Encoder, Pose Encoder, attention feature fusion, and Temporal Modules. It introduces a large, high-resolution ViViD dataset with diverse garment categories and an image–video joint training strategy to learn both fine garment details and temporal dynamics. Across qualitative and quantitative evaluations, ViViD outperforms prior image- and video-based methods, demonstrating superior visual quality and stability, and its ablations confirm the importance of garment-aware encoding and joint training. The work advances practical video try-on for e-commerce and design, and provides public dataset and code resources for further research.

Abstract

Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.

ViViD: Video Virtual Try-on using Diffusion Models

TL;DR

ViViD presents a diffusion-model-based framework for video virtual try-on that addresses temporal inconsistency and garment-detail preservation through a Garment Encoder, Pose Encoder, attention feature fusion, and Temporal Modules. It introduces a large, high-resolution ViViD dataset with diverse garment categories and an image–video joint training strategy to learn both fine garment details and temporal dynamics. Across qualitative and quantitative evaluations, ViViD outperforms prior image- and video-based methods, demonstrating superior visual quality and stability, and its ablations confirm the importance of garment-aware encoding and joint training. The work advances practical video try-on for e-commerce and design, and provides public dataset and code resources for further research.

Abstract

Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.
Paper Structure (19 sections, 2 equations, 11 figures, 1 table)

This paper contains 19 sections, 2 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Outfitted videos ($512 \times 384$) generated by our ViViD. The 1st and 4th rows are source videos. Please zoom in for more details.
  • Figure 2: An image-video pair from the ViViD dataset.
  • Figure 3: The overview of ViViD. First, the noisy video is concatenated with the clothing-agnostic video and the mask video, the pose feature is then added to it. The result serves as input for the UNet. Simultaneously, the Garment Encoder takes the clothing and the mask as input. After that, the attention feature fusion is conducted between the Garment Encoder and the UNet.
  • Figure 4: ViViD can handle a variety of clothing.
  • Figure 5: Qualitative comparison results of our ViViD with other visual try-on solutions on the VVT dataset.
  • ...and 6 more figures