Table of Contents
Fetching ...

Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

Yuki Hirakawa, Takashi Wada, Ryotaro Shimizu, Takuya Furusawa, Yuki Saito, Ryosuke Araki, Tianwei Chen, Fan Mo, Yoshimitsu Aoki

Abstract

Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.

Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

Abstract

Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fréchet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.
Paper Structure (26 sections, 13 equations, 18 figures, 12 tables)

This paper contains 26 sections, 13 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Overview of the VTON-QBench construction pipeline. VTON-QBench is built through five stages: (1) synthetic garment–person pair augmentation via FLUX.1-dev, (2) pseudo-triplet construction, (3) virtual try-on image generation using 14 representative VTON models, (4) crowdsourced human annotation with reference images, and (5) dataset curation to remove unreliable annotations. This pipeline ensures fashion diversity, controlled evaluation settings, and reliable human-aligned quality labels.
  • Figure 2: Synthetic garment–person pairs.
  • Figure 3: Distribution of Krippendorff’s $\alpha$.
  • Figure 5: Architecture of the VTON-IQA. The network processes $I_G$, $I_P$, and $I_V$ through a three-branch transformer backbone. The first half of layers perform independent feature extraction, while the latter half incorporate Interleaved Cross-Attention (ICA) to explicitly model cross-image interactions. The scoring module aggregates [CLS] representations to predict a human-aligned image-level quality score.
  • Figure 6: Qualitative results. From left to right: garment image, target person image, generated try-on results (columns 3–7), and ground-truth image. The top-right value shows the human score, and the top-left black box indicates each metric’s ranking.
  • ...and 13 more figures