Table of Contents
Fetching ...

Significance of Skeleton-based Features in Virtual Try-On

Debapriya Roy, Sanchayan Santra, Diganta Mukherjee, Bhabatosh Chanda

TL;DR

This work tackles the challenge of virtual try-on under large arm bending by proposing a part-based warping strategy that treats sleeves and torso separately and leverages geometry-guided features anchored to body structure. The core pipeline combines a mask-prediction module to identify target garment regions with an image synthesizer to inpaint and blend warped clothing onto the user image, trained in a self-supervised manner via the objective $P' = F(M,P)$. Key contributions include (i) sleeve-specific warp using bone-level correspondences and elbow bending constraints, (ii) landmark-/TPS-guided torso warp, (iii) a mask-prediction module with parsing to handle occlusions, and (iv) an ISN for seamless synthesis. Empirical results on the MPV dataset show competitive or superior Fréchet Inception Distance and Structural Similarity metrics compared to multiple baselines, demonstrating robust performance across poses and occlusions with practical implications for realistic online fashion experiences.

Abstract

The idea of \textit{Virtual Try-ON} (VTON) benefits e-retailing by giving an user the convenience of trying a clothing at the comfort of their home. In general, most of the existing VTON methods produce inconsistent results when a person posing with his arms folded i.e., bent or crossed, wants to try an outfit. The problem becomes severe in the case of long-sleeved outfits. As then, for crossed arm postures, overlap among different clothing parts might happen. The existing approaches, especially the warping-based methods employing \textit{Thin Plate Spline (TPS)} transform can not tackle such cases. To this end, we attempt a solution approach where the clothing from the source person is segmented into semantically meaningful parts and each part is warped independently to the shape of the person. To address the bending issue, we employ hand-crafted geometric features consistent with human body geometry for warping the source outfit. In addition, we propose two learning-based modules: a synthesizer network and a mask prediction network. All these together attempt to produce a photo-realistic, pose-robust VTON solution without requiring any paired training data. Comparison with some of the benchmark methods clearly establishes the effectiveness of the approach.

Significance of Skeleton-based Features in Virtual Try-On

TL;DR

This work tackles the challenge of virtual try-on under large arm bending by proposing a part-based warping strategy that treats sleeves and torso separately and leverages geometry-guided features anchored to body structure. The core pipeline combines a mask-prediction module to identify target garment regions with an image synthesizer to inpaint and blend warped clothing onto the user image, trained in a self-supervised manner via the objective . Key contributions include (i) sleeve-specific warp using bone-level correspondences and elbow bending constraints, (ii) landmark-/TPS-guided torso warp, (iii) a mask-prediction module with parsing to handle occlusions, and (iv) an ISN for seamless synthesis. Empirical results on the MPV dataset show competitive or superior Fréchet Inception Distance and Structural Similarity metrics compared to multiple baselines, demonstrating robust performance across poses and occlusions with practical implications for realistic online fashion experiences.

Abstract

The idea of \textit{Virtual Try-ON} (VTON) benefits e-retailing by giving an user the convenience of trying a clothing at the comfort of their home. In general, most of the existing VTON methods produce inconsistent results when a person posing with his arms folded i.e., bent or crossed, wants to try an outfit. The problem becomes severe in the case of long-sleeved outfits. As then, for crossed arm postures, overlap among different clothing parts might happen. The existing approaches, especially the warping-based methods employing \textit{Thin Plate Spline (TPS)} transform can not tackle such cases. To this end, we attempt a solution approach where the clothing from the source person is segmented into semantically meaningful parts and each part is warped independently to the shape of the person. To address the bending issue, we employ hand-crafted geometric features consistent with human body geometry for warping the source outfit. In addition, we propose two learning-based modules: a synthesizer network and a mask prediction network. All these together attempt to produce a photo-realistic, pose-robust VTON solution without requiring any paired training data. Comparison with some of the benchmark methods clearly establishes the effectiveness of the approach.
Paper Structure (12 sections, 7 equations, 9 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 7 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Results of different methods for different input clothing, model, and person combinations. A visual comparison with some benchmark methods illustrates the efficacy of the proposed method. See
  • Figure 2: Demonstration of simple and complex human poses.
  • Figure 3: Illustration of our overall virtual try-on approach.
  • Figure 4: (a) Illustration of the method for predicting the target warp, (b) Block diagram of the proposed mask predictor network (MPN), (c) Demonstration of human landmarks, (d) Demonstration of human part parsing cdcl results and pose key points.Best viewed in electronic version.
  • Figure 5: (a) (top)Elbow flexion and extension (picture courtesy elbow_flexion), (bottom) stretch and folds in clothing sleeve due to elbow flexion i.e., arm bending (picture courtesy sleeve_folding). (b) A graphical illustration of the arm bending phenomenon in human. A sample pair of model and person relevant to the illustrated scenario is given above for better understanding, (c) Plot of functions $f(\phi_1, \phi_2)$, $g(\phi)$ and $h(\phi_1, \phi_2)$, (d) Geometrical illustration of our warping method for sleeves warping for the two different scenarios. (Left) Example of a case when assumption 1 holds. (Right) Example of a case when assumption 2 holds. Here $\{A, B, C\}$ and $\{A', B', C'\}$ are the landmarks corresponding to the arm of the person and the model respectively. $X$ refers to the point belonging to the sleeve segment of the target warp and $X'$ is its corresponding source pixel.
  • ...and 4 more figures