ViViD: Video Virtual Try-on using Diffusion Models
Zixun Fang, Wei Zhai, Aimin Su, Hongliang Song, Kai Zhu, Mao Wang, Yu Chen, Zhiheng Liu, Yang Cao, Zheng-Jun Zha
TL;DR
ViViD presents a diffusion-model-based framework for video virtual try-on that addresses temporal inconsistency and garment-detail preservation through a Garment Encoder, Pose Encoder, attention feature fusion, and Temporal Modules. It introduces a large, high-resolution ViViD dataset with diverse garment categories and an image–video joint training strategy to learn both fine garment details and temporal dynamics. Across qualitative and quantitative evaluations, ViViD outperforms prior image- and video-based methods, demonstrating superior visual quality and stability, and its ablations confirm the importance of garment-aware encoding and joint training. The work advances practical video try-on for e-commerce and design, and provides public dataset and code resources for further research.
Abstract
Video virtual try-on aims to transfer a clothing item onto the video of a target person. Directly applying the technique of image-based try-on to the video domain in a frame-wise manner will cause temporal-inconsistent outcomes while previous video-based try-on solutions can only generate low visual quality and blurring results. In this work, we present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on. Specifically, we design the Garment Encoder to extract fine-grained clothing semantic features, guiding the model to capture garment details and inject them into the target video through the proposed attention feature fusion mechanism. To ensure spatial-temporal consistency, we introduce a lightweight Pose Encoder to encode pose signals, enabling the model to learn the interactions between clothing and human posture and insert hierarchical Temporal Modules into the text-to-image stable diffusion model for more coherent and lifelike video synthesis. Furthermore, we collect a new dataset, which is the largest, with the most diverse types of garments and the highest resolution for the task of video virtual try-on to date. Extensive experiments demonstrate that our approach is able to yield satisfactory video try-on results. The dataset, codes, and weights will be publicly available. Project page: https://becauseimbatman0.github.io/ViViD.
