Table of Contents
Fetching ...

Image-Based Virtual Try-On: A Survey

Dan Song, Xuanpu Zhang, Juan Zhou, Weizhi Nie, Ruofeng Tong, Mohan Kankanhalli, An-An Liu

TL;DR

This survey defines image-based virtual try-on as conditional person image generation conditioned on a target clothing image, and provides a taxonomy spanning pipeline types, cloth-agnostic person representations, and three core modules: try-on indication, cloth warping, and try-on synthesis. It introduces a comprehensive, unified evaluation framework including CLIP-based semantic scoring, standard metrics (SSIM, FID, LPIPS), and a cross-dataset protocol, and benchmarks representative methods on VITON-HD, complemented by a user study with 139 participants. The authors analyze trends across TPS, STN, flow-based, and implicit transformation warping, highlighting diffusion-based generation as yielding state-of-the-art results in many settings while noting persistent challenges in parsing dependency, pose handling, and controllability. They also outline unresolved issues and future directions, such as parser-free representations, diffusion-based controllable generation, multi-modal data integration, and the development of specialized datasets and metrics to drive industry-ready virtual try-on solutions.

Abstract

Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development.In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We additionally apply CLIP to assess the semantic alignment of try-on results, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset.In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.

Image-Based Virtual Try-On: A Survey

TL;DR

This survey defines image-based virtual try-on as conditional person image generation conditioned on a target clothing image, and provides a taxonomy spanning pipeline types, cloth-agnostic person representations, and three core modules: try-on indication, cloth warping, and try-on synthesis. It introduces a comprehensive, unified evaluation framework including CLIP-based semantic scoring, standard metrics (SSIM, FID, LPIPS), and a cross-dataset protocol, and benchmarks representative methods on VITON-HD, complemented by a user study with 139 participants. The authors analyze trends across TPS, STN, flow-based, and implicit transformation warping, highlighting diffusion-based generation as yielding state-of-the-art results in many settings while noting persistent challenges in parsing dependency, pose handling, and controllability. They also outline unresolved issues and future directions, such as parser-free representations, diffusion-based controllable generation, multi-modal data integration, and the development of specialized datasets and metrics to drive industry-ready virtual try-on solutions.

Abstract

Image-based virtual try-on aims to synthesize a naturally dressed person image with a clothing image, which revolutionizes online shopping and inspires related topics within image generation, showing both research significance and commercial potential. However, there is a gap between current research progress and commercial applications and an absence of comprehensive overview of this field to accelerate the development.In this survey, we provide a comprehensive analysis of the state-of-the-art techniques and methodologies in aspects of pipeline architecture, person representation and key modules such as try-on indication, clothing warping and try-on stage. We additionally apply CLIP to assess the semantic alignment of try-on results, and evaluate representative methods with uniformly implemented evaluation metrics on the same dataset.In addition to quantitative and qualitative evaluation of current open-source methods, unresolved issues are highlighted and future research directions are prospected to identify key trends and inspire further exploration. The uniformly implemented evaluation metrics, dataset and collected methods will be made public available at https://github.com/little-misfit/Survey-Of-Virtual-Try-On.
Paper Structure (27 sections, 1 equation, 20 figures, 2 tables)

This paper contains 27 sections, 1 equation, 20 figures, 2 tables.

Figures (20)

  • Figure 1: A concise timeline of image-based virtual try-on milestones. Different colors indicate the main characteristic of method. Please refer to Table \ref{['big_table']} for detailed comparisons.
  • Figure 2: Basic pipelines of image-based virtual try-on. Pipelines i@ and ii@ are both single-stage approaches, where pipeline i@ utilizes a single generator to directly generate the try-on image, while pipeline ii@ aligns features in the feature domain before generating the try-on image. Pipelines iii@ and iv@ are both two-stage pipelines, where the former utilizes person representation as the bridge while the later uses warped clothing. Pipelines v@ and vi@ are three-stage pipelines, which differ in the order of Try-On Indication and Cloth Warping. Pipeline vii@ is an improvement over v@ and vi@, which simultaneously performs Try-On Indication and Cloth Warping.
  • Figure 3: Three supplemental structures: (a) Teacher-Student network involving one set of in-shop clothes; (b) Teacher-Student network involving two sets of clothes; (c) Cycle-GAN structure with two sets of clothes.
  • Figure 4: Person representation types. Existing representations heavily rely on human parsers to indicate pose, shape or semantic segmentation, which can be categorized into different types: RGB image $\mathcal{P}_{\substack{1,2,3,4}}$, pose $\mathcal{P}_{\substack{5}}$, silhouette $\mathcal{P}_{\substack{6,7}}$, Densepose $\mathcal{P}_{\substack{8,9}}$, semantic segmentation $\mathcal{P}_{\substack{10,11,12}}$, and landmark $\mathcal{P}_{\substack{13}}$. * Subfigures of $\mathcal{P}_{\substack{13}}$ are cited from DGP_yuduiqi_ESF.
  • Figure 5: Illustration for several spatial transformation approaches: (a) Thin Plate Spline (TPS), (b) Spatial Transformation Network (STN) and (c) Flow. (d) shows the usage statistics of existing methods.
  • ...and 15 more figures