CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

Donghao Luo; Yujie Liang; Xu Peng; Xiaobin Hu; Boyuan Jiang; Chengming Xu; Taisong Jin; Chengjie Wang; Yanwei Fu

CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

Donghao Luo, Yujie Liang, Xu Peng, Xiaobin Hu, Boyuan Jiang, Chengming Xu, Taisong Jin, Chengjie Wang, Yanwei Fu

TL;DR

CrossVTON addresses cross-category virtual try-on by introducing tri-zone priors that partition the model image into $Z^{tryon}$, $Z^{recon}$, and $Z^{imagi}$, enabling reasoning about garment–model fit for diverse categories. It uses a two-stage diffusion-based pipeline (Tri-zone Net to produce $M_3$ priors and Try-on Net to synthesize) and a progressive two-round data-construction scheme to cover intra-category, any-to-dress, and dress-to-any cases, with a training objective $L_g = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,1), t \sim \mathcal{U}(t)} [ w(t) \| \epsilon_\theta(z_t; I_{vec}, t) - \epsilon \|^2 ]$. Empirically, CrossVTON achieves state-of-the-art results on cross-category VTON benchmarks, demonstrating strong realism, detail preservation, and category-agnostic reasoning with practical impact for e-commerce. The combination of tri-zone priors and an iterative data-construction paradigm yields robust performance across intra-category and cross-category transitions, addressing real-world size-mismatch challenges.

Abstract

Despite remarkable progress in image-based virtual try-on systems, generating realistic and robust fitting images for cross-category virtual try-on remains a challenging task. The primary difficulty arises from the absence of human-like reasoning, which involves addressing size mismatches between garments and models while recognizing and leveraging the distinct functionalities of various regions within the model images. To address this issue, we draw inspiration from human cognitive processes and disentangle the complex reasoning required for cross-category try-on into a structured framework. This framework systematically decomposes the model image into three distinct regions: try-on, reconstruction, and imagination zones. Each zone plays a specific role in accommodating the garment and facilitating realistic synthesis. To endow the model with robust reasoning capabilities for cross-category scenarios, we propose an iterative data constructor. This constructor encompasses diverse scenarios, including intra-category try-on, any-to-dress transformations (replacing any garment category with a dress), and dress-to-any transformations (replacing a dress with another garment category). Utilizing the generated dataset, we introduce a tri-zone priors generator that intelligently predicts the try-on, reconstruction, and imagination zones by analyzing how the input garment is expected to align with the model image. Guided by these tri-zone priors, our proposed method, CrossVTON, achieves state-of-the-art performance, surpassing existing baselines in both qualitative and quantitative evaluations. Notably, it demonstrates superior capability in handling cross-category virtual try-on, meeting the complex demands of real-world applications.

CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

TL;DR

Abstract

CrossVTON: Mimicking the Logic Reasoning on Cross-category Virtual Try-on guided by Tri-zone Priors

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)