CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Zheng Chong; Wenqing Zhang; Shiyue Zhang; Jun Zheng; Xiao Dong; Haoxiang Li; Yiling Wu; Dongmei Jiang; Xiaodan Liang

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Zheng Chong, Wenqing Zhang, Shiyue Zhang, Jun Zheng, Xiao Dong, Haoxiang Li, Yiling Wu, Dongmei Jiang, Xiaodan Liang

TL;DR

This work introduces CatV2TON, a unified diffusion-transformer model for vision-based virtual try-on that handles both images and videos by temporally concatenating garment and person inputs. It achieves high-quality static and dynamic try-ons while keeping trainable backbone parameters below $20\%$, with no extra modules. To enable long-video generation, it proposes an overlapping clip-based inference strategy and Adaptive Clip Normalization (AdaCN) to preserve temporal coherence with reduced resources. A new ViViD-S dataset is created by filtering back-facing frames and applying 3D mask smoothing to improve temporal consistency. Across image and video benchmarks, CatV2TON outperforms strong baselines, demonstrating the potential of a unified, efficient approach to realistic virtual try-ons in diverse scenarios.

Abstract

Virtual try-on (VTON) technology has gained attention due to its potential to transform online retail by enabling realistic clothing visualization of images and videos. However, most existing methods struggle to achieve high-quality results across image and video try-on tasks, especially in long video scenarios. In this work, we introduce CatV2TON, a simple and effective vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks with a single diffusion transformer model. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance across static and dynamic settings. For efficient long-video generation, we propose an overlapping clip-based inference strategy that uses sequential frame guidance and Adaptive Clip Normalization (AdaCN) to maintain temporal consistency with reduced resource demands. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing for enhanced temporal consistency. Comprehensive experiments demonstrate that CatV2TON outperforms existing methods in both image and video try-on tasks, offering a versatile and reliable solution for realistic virtual try-ons across diverse scenarios.

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

TL;DR

, with no extra modules. To enable long-video generation, it proposes an overlapping clip-based inference strategy and Adaptive Clip Normalization (AdaCN) to preserve temporal coherence with reduced resources. A new ViViD-S dataset is created by filtering back-facing frames and applying 3D mask smoothing to improve temporal consistency. Across image and video benchmarks, CatV2TON outperforms strong baselines, demonstrating the potential of a unified, efficient approach to realistic virtual try-ons in diverse scenarios.

Abstract

Paper Structure (19 sections, 1 equation, 8 figures, 5 tables)

This paper contains 19 sections, 1 equation, 8 figures, 5 tables.

Introduction
Related Work
Video Synthesis and Generation
Vision-based Virtual Try-On
Method
Vision-based Try-On Diffusion Transformer
Input Conditions
Network Structure
Training Strategy
Overlapping Clip-Based Inference
Experiments
Datasets
Implementation Details
Metrics
Qualitative Comparison
...and 4 more sections

Figures (8)

Figure 1: Examples of CatV$^2$TON's unified virtual try-on capabilities, demonstrating high-quality garment consistency across both image-based and video-based try-on tasks, including dynamic long-video scenarios.
Figure 2: Overview of the CatV$^2$TON architecture. CatV$^2$TON uses DiT Peebles2022DiT as the backbone, with the first DiT block duplicated as the Pose Encoder. The person and garment conditions are concatenated temporally as try-on conditions. The entire trainable portion consists only of the self-attention layers and Pose Encoder, accounting for less than 1/5 of the total parameters.
Figure 3: Illustration of the Overlapping Clip-Based Inference strategy. (a) A long video is divided into $n$ overlapping clips, with each clip consisting of repeated frames. The last $k$ frames of each clip are used as prompt frames for generating the next clip. (b) Adaptive Clip Normalization (AdaCN) is applied to normalize the entire clip based on the mean and standard deviation of the prompt frame features and the denoised prompt frames, ensuring smooth continuity across clips in the generated video.
Figure 4: Qualitative comparison on the ViViD fang2024vivid dataset for dresses. We use Stable and OOTD as the short for StableVITON kim2023stableviton and OOTDiffusion xu2024ootdiffusion. Additional comparison results are provided in the supplementary materials. Please zoom in for more details.
Figure 5: Qualitative comparison on the ViViD fang2024vivid dataset for lower. We use Stable and OOTD as the short for StableVITON kim2023stableviton and OOTDiffusion xu2024ootdiffusion. Additional comparison results are provided in the supplementary materials. Please zoom in for more details.
...and 3 more figures

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

TL;DR

Abstract

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)