Table of Contents
Fetching ...

CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

Donghao Luo, Yujie Liang, Xu Peng, Xiaobin Hu, Boyuan Jiang, Chengming Xu, Taisong Jin, Chengjie Wang, Yanwei Fu

TL;DR

CrossVTON addresses cross-category virtual try-on by introducing tri-zone priors that partition the model image into $Z^{tryon}$, $Z^{recon}$, and $Z^{imagi}$, enabling reasoning about garment–model fit for diverse categories. It uses a two-stage diffusion-based pipeline (Tri-zone Net to produce $M_3$ priors and Try-on Net to synthesize) and a progressive two-round data-construction scheme to cover intra-category, any-to-dress, and dress-to-any cases, with a training objective $L_g = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1), t \sim \mathcal{U}(t)} [ w(t) \| \epsilon_\theta(z_t; I_{vec}, t) - \epsilon \|^2 ]$. Empirically, CrossVTON achieves state-of-the-art results on cross-category VTON benchmarks, demonstrating strong realism, detail preservation, and category-agnostic reasoning with practical impact for e-commerce. The combination of tri-zone priors and an iterative data-construction paradigm yields robust performance across intra-category and cross-category transitions, addressing real-world size-mismatch challenges.

Abstract

Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.

CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

TL;DR

CrossVTON addresses cross-category virtual try-on by introducing tri-zone priors that partition the model image into , , and , enabling reasoning about garment–model fit for diverse categories. It uses a two-stage diffusion-based pipeline (Tri-zone Net to produce priors and Try-on Net to synthesize) and a progressive two-round data-construction scheme to cover intra-category, any-to-dress, and dress-to-any cases, with a training objective . Empirically, CrossVTON achieves state-of-the-art results on cross-category VTON benchmarks, demonstrating strong realism, detail preservation, and category-agnostic reasoning with practical impact for e-commerce. The combination of tri-zone priors and an iterative data-construction paradigm yields robust performance across intra-category and cross-category transitions, addressing real-world size-mismatch challenges.

Abstract

Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Tri-zone prior decomposes the content of the model image to determine whether it belongs to the try-on, reconstruction, or imagination zone. Different from commonly used binary mask priors, such priors can endow our CrossVTON with the capability of cross-category virtual try-on.
  • Figure 2: An overview of the whole pipeline and the structure of CrossVTON which consists of Tri-zone and Try-on Net. The pipeline illustrates two rounds iterative cross-category data construction by synthesizing the Intra-category, Any-to-dress, and Dress-to-any data. At each round, the CrossVTON is trained progressively to generate tri-zone priors and endow the ability of cross-category virtual try-on.
  • Figure 3: First round intra-category and any-to-dress data construction
  • Figure 5: Visual results on CCDC (top) and CCGD (bottom). Best viewed when zoomed in.
  • Figure 6: Ablation study on w/wo tri-zone prior
  • ...and 1 more figures