RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency
Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, Haoqian Wang
TL;DR
RealVVT addresses the persistent challenge of photorealistic, temporally-coherent video virtual try-on by building on Stable Video Diffusion with a dual U‑Net architecture. It introduces three key contributions: Agnostic Mask-Guided Attention Loss to ensure accurate intra-frame garment fitting, Clothing & Temporal Consistency Attention to stabilize garment appearance across frames, and Pose-guided Long VVT to extend generation to long sequences via keyframe-based interpolation guided by DensePose. The approach delivers state-of-the-art performance on both image- and video-based VTO benchmarks (e.g., VVT, ViViD, VITON-HD, DressCode), with robust preservation of garment shape, texture, and color under diverse poses and viewpoints. This work advances practical VTO for fashion e-commerce and virtual fitting by achieving high fidelity and stable garment dynamics in dynamic video contexts, while discussing dataset limitations and future directions for high-quality video VTO data.
Abstract
Virtual try-on has emerged as a pivotal task at the intersection of computer vision and fashion, aimed at digitally simulating how clothing items fit on the human body. Despite notable progress in single-image virtual try-on (VTO), current methodologies often struggle to preserve a consistent and authentic appearance of clothing across extended video sequences. This challenge arises from the complexities of capturing dynamic human pose and maintaining target clothing characteristics. We leverage pre-existing video foundation models to introduce RealVVT, a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts. Our methodology encompasses a Clothing & Temporal Consistency strategy, an Agnostic-guided Attention Focus Loss mechanism to ensure spatial consistency, and a Pose-guided Long Video VTO technique adept at handling extended video sequences.Extensive experiments across various datasets confirms that our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks, offering a viable solution for practical applications within the realms of fashion e-commerce and virtual fitting environments.
