Table of Contents
Fetching ...

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Dongmei Jiang, Xiaodan Liang

TL;DR

CatVTON presents a lean diffusion-based virtual try-on framework that removes image/text encoders and ReferenceNet, employing spatial concatenation of garment and person inputs within a VAE–UNet backbone. It achieves high-quality results with only self-attention trainable parameters (≈49.57M) and a total of 899.06M parameters, enabling inference without pose, parsing, or captioning. Extensive experiments on VITON-HD, DressCode, and DeepFashion show superior qualitative and quantitative performance and strong generalization in the wild, while reducing memory and compute. The work highlights practical implications for deployment of diffusion-based VTON in real-world applications and points to future efficiency-focused directions.

Abstract

Virtual try-on methods based on diffusion models achieve realistic effects but often require additional encoding modules, a large number of training parameters, and complex preprocessing, which increases the burden on training and inference. In this work, we re-evaluate the necessity of additional modules and analyze how to improve training efficiency and reduce redundant steps in the inference process. Based on these insights, we propose CatVTON, a simple and efficient virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals by concatenating them along spatial dimensions as inputs of the diffusion model. The efficiency of CatVTON is reflected in three aspects: (1) Lightweight network. CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders as well as cross-attentions, and includes just 899.06M parameters. (2) Parameter-efficient training. Through experimental analysis, we identify self-attention modules as crucial for adapting pre-trained diffusion models to the virtual try-on task, enabling high-quality results with only 49.57M training parameters. (3) Simplified inference. CatVTON eliminates unnecessary preprocessing, such as pose estimation, human parsing, and captioning, requiring only a person image and garment reference to guide the virtual try-on process, reducing over 49% memory usage compared to other diffusion-based methods. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods and demonstrates strong generalization performance in in-the-wild scenarios, despite being trained solely on public datasets with 73K samples.

CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

TL;DR

CatVTON presents a lean diffusion-based virtual try-on framework that removes image/text encoders and ReferenceNet, employing spatial concatenation of garment and person inputs within a VAE–UNet backbone. It achieves high-quality results with only self-attention trainable parameters (≈49.57M) and a total of 899.06M parameters, enabling inference without pose, parsing, or captioning. Extensive experiments on VITON-HD, DressCode, and DeepFashion show superior qualitative and quantitative performance and strong generalization in the wild, while reducing memory and compute. The work highlights practical implications for deployment of diffusion-based VTON in real-world applications and points to future efficiency-focused directions.

Abstract

Virtual try-on methods based on diffusion models achieve realistic effects but often require additional encoding modules, a large number of training parameters, and complex preprocessing, which increases the burden on training and inference. In this work, we re-evaluate the necessity of additional modules and analyze how to improve training efficiency and reduce redundant steps in the inference process. Based on these insights, we propose CatVTON, a simple and efficient virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals by concatenating them along spatial dimensions as inputs of the diffusion model. The efficiency of CatVTON is reflected in three aspects: (1) Lightweight network. CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders as well as cross-attentions, and includes just 899.06M parameters. (2) Parameter-efficient training. Through experimental analysis, we identify self-attention modules as crucial for adapting pre-trained diffusion models to the virtual try-on task, enabling high-quality results with only 49.57M training parameters. (3) Simplified inference. CatVTON eliminates unnecessary preprocessing, such as pose estimation, human parsing, and captioning, requiring only a person image and garment reference to guide the virtual try-on process, reducing over 49% memory usage compared to other diffusion-based methods. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods and demonstrates strong generalization performance in in-the-wild scenarios, despite being trained solely on public datasets with 73K samples.
Paper Structure (28 sections, 8 equations, 8 figures, 6 tables)

This paper contains 28 sections, 8 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: (a) Structure comparison of different try-on methods. CatVTON eliminates the need for garment warping or additional ReferenceNet resulting in a simple structure. (b) Efficiency comparison with diffusion-based try-on methods. Each method is represented by two concentric circles, where the outer circle denotes the total parameters and the inner circle indicates the trainable parameters. CatVTON achieves lower FID on the VITON-HD dataset with fewer total parameters, trainable parameters, and memory usage.
  • Figure 2: Overview of CatVTON. Our method achieves high-quality try-ons by simply concatenating the conditional image (garment or reference person) with the target person image in the spatial dimension, ensuring they remain in the same feature space during denoising. Only self-attention parameters, which provide global interaction, are learnable, while cross-attention for text interaction is omitted. No additional conditions (pose, parsing) are needed, resulting in a lightweight network with minimal trainable parameters and simplified inference.
  • Figure 3: Overview of the mask-free training pipeline. We first use the trained mask-based model to generate synthetic person image from randomly sampled person-garment pairs. These synthetic person images, along with their corresponding original person and garment images, form the training data for the mask-free model.
  • Figure 4: Qualitative comparison on the VITON-HD and DressCode dataset. CatVTON demonstrates a distinct advantage in handling complex patterns and text. Please zoom in for more details.
  • Figure 5: Qualitative results and comparisons in in-the-wild scenarios. OutfitAnyone sun2024outfitanyoneultrahighqualityvirtual only supports inference on its provided person images. Our method combines background, person, and garment more naturally in complex scenarios. Please zoom in for more details.
  • ...and 3 more figures