Clothing agnostic Pre-inpainting Virtual Try-ON
Sehyun Kim, Hye Jun Lee, Jiwoo Lee, Taemin Lee
TL;DR
This paper addresses the limitations of diffusion-based virtual try-on, specifically bottom detection accuracy and persistence of input clothing silhouettes, by introducing CaP-VTON. The approach fuses a DressCode-based multi-category masking scheme with a Stable Diffusion skin inpainting pipeline and a Generate Skin module to remove existing clothing and restore exposed skin, producing clothing-agnostic, full-body synthesis. Empirical results show CaP-VTON achieves 92.5% short-sleeve silhouette accuracy (a 15.4-point gain over Leffa) and maintains high visual quality with $FID$, $SSIM$, and $LPIPS$ metrics, while enabling cross-category clothing replacement. The method offers practical impact for e-commerce, avatar creation, and personalized styling by delivering stable, high-fidelity virtual wear across varied clothing types and poses, and it remains model-agnostic for integration with diffusion-based systems.
Abstract
With the development of deep learning technology, virtual try-on technology has devel-oped important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa technology has addressed the texture distortion problem of diffusion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette persist in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing Agnostic Pre-Inpainting Virtual Try-On). CaP-VTON integrates DressCode-based multi-category masking and Stable Diffu-sion-based skin inflation preprocessing; in particular, a generated skin module was in-troduced to solve skin restoration problems that occur when long-sleeved images are con-verted to short-sleeved or sleeveless ones, introducing a preprocessing structure that im-proves the naturalness and consistency of full-body clothing synthesis, and allowing the implementation of high-quality restoration considering human posture and color. As a result, CaP-VTON achieved 92.5%, which is 15.4% better than Leffa, in short-sleeved syn-thesis accuracy, and consistently reproduced the style and shape of the reference clothing in visual evaluation. These structures maintain model-agnostic properties and are appli-cable to various diffusion-based virtual inspection systems; they can also contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.
