Table of Contents
Fetching ...

VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

Hu Xiaobin, Liang Yujie, Luo Donghao, Peng Xu, Zhang Jiangning, Zhu Junwei, Wang Chengjie, Fu Yanwei

TL;DR

VTBench introduces a hierarchical, multi-dimensional benchmark to rigorously evaluate image-based virtual try-on in real-world conditions, addressing misalignment between perceptual quality and existing metrics. It defines six disentangled evaluation dimensions across three high-level categories, with dedicated test sets and novel unpaired metrics (including $E_{TSS}$ and $E_{size}$) and holds human preference annotations to ensure perceptual alignment. The paper collects 50k images and constructs CBC, FTF, CSF, and HOC test sets, then evaluates 15 baselines spanning GAN, UNet-diffusion, and DiT-diffusion approaches, finding diffusion-based methods—especially FitDit—for garment preservation and realism, while highlighting limitations of traditional metrics like FID/KID for texture fidelity. These findings demonstrate the value of VTBench for diagnosing model capabilities, guiding architectural and dataset choices, and accelerating progress toward robust real-world VTON systems, with an open-source release of data, protocols, results, and annotations.

Abstract

While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.

VTBench: Comprehensive Benchmark Suite Towards Real-World Virtual Try-on Models

TL;DR

VTBench introduces a hierarchical, multi-dimensional benchmark to rigorously evaluate image-based virtual try-on in real-world conditions, addressing misalignment between perceptual quality and existing metrics. It defines six disentangled evaluation dimensions across three high-level categories, with dedicated test sets and novel unpaired metrics (including and ) and holds human preference annotations to ensure perceptual alignment. The paper collects 50k images and constructs CBC, FTF, CSF, and HOC test sets, then evaluates 15 baselines spanning GAN, UNet-diffusion, and DiT-diffusion approaches, finding diffusion-based methods—especially FitDit—for garment preservation and realism, while highlighting limitations of traditional metrics like FID/KID for texture fidelity. These findings demonstrate the value of VTBench for diagnosing model capabilities, guiding architectural and dataset choices, and accelerating progress toward robust real-world VTON systems, with an open-source release of data, protocols, results, and annotations.

Abstract

While virtual try-on has achieved significant progress, evaluating these models towards real-world scenarios remains a challenge. A comprehensive benchmark is essential for three key reasons:(1) Current metrics inadequately reflect human perception, particularly in unpaired try-on settings;(2)Most existing test sets are limited to indoor scenarios, lacking complexity for real-world evaluation; and (3) An ideal system should guide future advancements in virtual try-on generation. To address these needs, we introduce VTBench, a hierarchical benchmark suite that systematically decomposes virtual image try-on into hierarchical, disentangled dimensions, each equipped with tailored test sets and evaluation criteria. VTBench exhibits three key advantages:1) Multi-Dimensional Evaluation Framework: The benchmark encompasses five critical dimensions for virtual try-on generation (e.g., overall image quality, texture preservation, complex background consistency, cross-category size adaptability, and hand-occlusion handling). Granular evaluation metrics of corresponding test sets pinpoint model capabilities and limitations across diverse, challenging scenarios.2) Human Alignment: Human preference annotations are provided for each test set, ensuring the benchmark's alignment with perceptual quality across all evaluation dimensions. (3) Valuable Insights: Beyond standard indoor settings, we analyze model performance variations across dimensions and investigate the disparity between indoor and real-world try-on scenarios. To foster the field of virtual try-on towards challenging real-world scenario, VTBench will be open-sourced, including all test sets, evaluation protocols, generated results, and human annotations.

Paper Structure

This paper contains 14 sections, 8 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of VTBench. We propose VTBench, the first comprehensive benchmark suite designed to evaluate image-based virtual try-on models. To enable fine-grained and objective assessment, we propose a comprehensive and hierarchical Evaluation Dimension Suite that systematically decomposes “image virtual try-on quality” into multiple well-defined dimensions. For each dimension and content category, we curate a dedicated test set and develop reliable metrics, and then sample virtual try-on images from 15 virtual try-on models based on different foundations to provide in-depth insights. We also conduct human preference annotation for virtual try-on results across all dimensions, demonstrating strong alignment between VTBench’s automated evaluations and human perceptual judgments. Our benchmark delivers multi-perspective insights, advancing the systematic assessment of virtual try-on technologies.
  • Figure 2: VTBench Evaluation Results of SOTA Virtual try-on Models including GAN, UNet-based and DiT-based Diffusion.
  • Figure 3: Validate VTBench’s Human Alignment. Our experimental results demonstrate that VTBench evaluations across all dimensions exhibit a strong alignment with human perceptual judgments. Each plot illustrates the verification results for a specific VTBench dimension, where a single dot represents the human preference win rate (x-axis) and the VTBench evaluation win rate (y-axis) for a given virtual try-on generation model. To assess the correlation, we perform a linear regression analysis and compute the Spearman’s rank correlation coefficient for each dimension.