Table of Contents
Fetching ...

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Hyunsoo Cha, Wonjung Woo, Byungjun Kim, Hanbyul Joo

Abstract

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Vanast: Virtual Try-On with Human Image Animation via Synthetic Triplet Supervision

Abstract

We present Vanast, a unified framework that generates garment-transferred human animation videos directly from a single human image, garment images, and a pose guidance video. Conventional two-stage pipelines treat image-based virtual try-on and pose-driven animation as separate processes, which often results in identity drift, garment distortion, and front-back inconsistency. Our model addresses these issues by performing the entire process in a single unified step to achieve coherent synthesis. To enable this setting, we construct large-scale triplet supervision. Our data generation pipeline includes generating identity-preserving human images in alternative outfits that differ from garment catalog images, capturing full upper and lower garment triplets to overcome the single-garment-posed video pair limitation, and assembling diverse in-the-wild triplets without requiring garment catalog images. We further introduce a Dual Module architecture for video diffusion transformers to stabilize training, preserve pretrained generative quality, and improve garment accuracy, pose adherence, and identity preservation while supporting zero-shot garment interpolation. Together, these contributions allow Vanast to produce high-fidelity, identity-consistent animation across a wide range of garment types.

Paper Structure

This paper contains 11 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Vanast. Given a human image and one or more garment images, our method generates virtual try-on with human image animation conditioned on a pose video while preserving identity.
  • Figure 2: Overview of Vanast Pipeline. Our Vanast framework generates virtual try-on human animation videos from a human image, garment images, and a pose video. By incorporating scalable human-image and garment-image generation pipelines, our method avoids dataset-specific constraints and trains effectively at scale. The Dual Modules architecture ensures that the three conditioning signals, human image $\mathbf{I}^{\mathbf{G'}}$, garment images $\mathbf{G}$, and pose video $\mathbf{K}$, are faithfully reflected in the resulting video.
  • Figure 3: Samples of Synthetic Triplet Datasets. We show samples of the datasets used for generation and training. The triplet construction contributes to enabling the model to preserve identity while accurately transferring garments and producing animation videos that follow the target pose.
  • Figure 4: Qualitative Comparisons (Subject-to-Image-based). We compare our results with baselines constructed by combining subject-to-image generation and animation models. Our method produces the most accurate pose following and garment transfer while preserving identity with high fidelity.
  • Figure 5: Qualitative Comparisons (Virtual Try-On-based). We compare our results with baselines formed by combining image virtual try-on models with animation models. Our method achieves the most accurate pose following and garment transfer while preserving identity with the highest fidelity.
  • ...and 5 more figures