RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

Siqi Li; Zhengkai Jiang; Jiawei Zhou; Zhihong Liu; Xiaowei Chi; Haoqian Wang

RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

Siqi Li, Zhengkai Jiang, Jiawei Zhou, Zhihong Liu, Xiaowei Chi, Haoqian Wang

TL;DR

RealVVT addresses the persistent challenge of photorealistic, temporally-coherent video virtual try-on by building on Stable Video Diffusion with a dual U‑Net architecture. It introduces three key contributions: Agnostic Mask-Guided Attention Loss to ensure accurate intra-frame garment fitting, Clothing & Temporal Consistency Attention to stabilize garment appearance across frames, and Pose-guided Long VVT to extend generation to long sequences via keyframe-based interpolation guided by DensePose. The approach delivers state-of-the-art performance on both image- and video-based VTO benchmarks (e.g., VVT, ViViD, VITON-HD, DressCode), with robust preservation of garment shape, texture, and color under diverse poses and viewpoints. This work advances practical VTO for fashion e-commerce and virtual fitting by achieving high fidelity and stable garment dynamics in dynamic video contexts, while discussing dataset limitations and future directions for high-quality video VTO data.

Abstract

Virtual try-on has emerged as a pivotal task at the intersection of computer vision and fashion, aimed at digitally simulating how clothing items fit on the human body. Despite notable progress in single-image virtual try-on (VTO), current methodologies often struggle to preserve a consistent and authentic appearance of clothing across extended video sequences. This challenge arises from the complexities of capturing dynamic human pose and maintaining target clothing characteristics. We leverage pre-existing video foundation models to introduce RealVVT, a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts. Our methodology encompasses a Clothing & Temporal Consistency strategy, an Agnostic-guided Attention Focus Loss mechanism to ensure spatial consistency, and a Pose-guided Long Video VTO technique adept at handling extended video sequences.Extensive experiments across various datasets confirms that our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks, offering a viable solution for practical applications within the realms of fashion e-commerce and virtual fitting environments.

RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

TL;DR

Abstract

Paper Structure (16 sections, 9 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 7 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Video Virtual Try-On
Video Generation via Diffusion Models
Proposed Approach
Preliminary
Overview
Agnostic Mask-Guided Attention for Clothing Consistency
Clothing&Temporal Consistency Attention
Pose-guided Long VVT
Experiments
Comparison with State-of-the-Art Methods
Ablation Study
Conclusion
More Image Dataset Visual Results
...and 1 more sections

Figures (7)

Figure 1: RealVVT is a novel framework that takes as input a video of a human performing arbitrary motions from any viewpoint, along with a garment (e.g., upper body, lower body, or dress) to be virtually worn. The system seamlessly integrates the garment into the person's “OOTD”(Outfit Of The Day) and evaluates its aesthetic compatibility and fit through dynamic video results. This figure showcases a subset of generated results, demonstrating RealVVT's ability to maintain the characteristics and details of the target garment while ensuring consistency with the subject's motion.
Figure 2: An overview of RealVVT. A Reference Net and CLIP Encoder extract target garment features, while the input video is processed by Denoising UNet. The figure omits the VAE encoder and decoder for clarity. The right side illustrates the mechanisms of our proposed Clothing & Temporal Consistency Attention and Pose-guided Long VVT components.
Figure 3: Virtual try-on comparison with state-of-the-art methods. (b) StableViton, an image-based VTO method, exhibits significant flickering in continuous video generation. (c) ViViD, a video-based VTO method, suffers from unstable clothing appearance, particularly noticeable around the neckline, as well as visible artifacts. (d) Our method ensures consistent clothing appearance with realistic texture preservation and minimal artifacts.
Figure 4: Virtual try-on results for a challenging case: fitting a small garment onto a large agnostic mask video. Comparisons are shown between ViViD and RealVVT, both trained at 512×384 resolution and tested at 832×624 resolution.
Figure 5: Effect of Agnostic Mask-Guided loss and Clothing & Temporal Consistency Attention. The first and third images in (a) are input frames, while the second image is not used as input and instead serves to illustrate the original video.
...and 2 more figures

RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

TL;DR

Abstract

RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (7)