Table of Contents
Fetching ...

High-Fidelity Virtual Try-on with Large-Scale Unpaired Learning

Han Yang, Yanlong Zang, Ziwei Liu

TL;DR

Boosted Virtual Try-on shows great generalizability and scalability to various dressing styles and data sources, and an unpaired try-on synthesizer by constructing pseudo training pairs with randomly misaligned on-model clothes, where intricate skin texture and clothes boundaries can be generated.

Abstract

Virtual try-on (VTON) transfers a target clothing image to a reference person, where clothing fidelity is a key requirement for downstream e-commerce applications. However, existing VTON methods still fall short in high-fidelity try-on due to the conflict between the high diversity of dressing styles (\eg clothes occluded by pants or distorted by posture) and the limited paired data for training. In this work, we propose a novel framework \textbf{Boosted Virtual Try-on (BVTON)} to leverage the large-scale unpaired learning for high-fidelity try-on. Our key insight is that pseudo try-on pairs can be reliably constructed from vastly available fashion images. Specifically, \textbf{1)} we first propose a compositional canonicalizing flow that maps on-model clothes into pseudo in-shop clothes, dubbed canonical proxy. Each clothing part (sleeves, torso) is reversely deformed into an in-shop-like shape to compositionally construct the canonical proxy. \textbf{2)} Next, we design a layered mask generation module that generates accurate semantic layout by training on canonical proxy. We replace the in-shop clothes used in conventional pipelines with the derived canonical proxy to boost the training process. \textbf{3)} Finally, we propose an unpaired try-on synthesizer by constructing pseudo training pairs with randomly misaligned on-model clothes, where intricate skin texture and clothes boundaries can be generated. Extensive experiments on high-resolution ($1024\times768$) datasets demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. Notably, BVTON shows great generalizability and scalability to various dressing styles and data sources.

High-Fidelity Virtual Try-on with Large-Scale Unpaired Learning

TL;DR

Boosted Virtual Try-on shows great generalizability and scalability to various dressing styles and data sources, and an unpaired try-on synthesizer by constructing pseudo training pairs with randomly misaligned on-model clothes, where intricate skin texture and clothes boundaries can be generated.

Abstract

Virtual try-on (VTON) transfers a target clothing image to a reference person, where clothing fidelity is a key requirement for downstream e-commerce applications. However, existing VTON methods still fall short in high-fidelity try-on due to the conflict between the high diversity of dressing styles (\eg clothes occluded by pants or distorted by posture) and the limited paired data for training. In this work, we propose a novel framework \textbf{Boosted Virtual Try-on (BVTON)} to leverage the large-scale unpaired learning for high-fidelity try-on. Our key insight is that pseudo try-on pairs can be reliably constructed from vastly available fashion images. Specifically, \textbf{1)} we first propose a compositional canonicalizing flow that maps on-model clothes into pseudo in-shop clothes, dubbed canonical proxy. Each clothing part (sleeves, torso) is reversely deformed into an in-shop-like shape to compositionally construct the canonical proxy. \textbf{2)} Next, we design a layered mask generation module that generates accurate semantic layout by training on canonical proxy. We replace the in-shop clothes used in conventional pipelines with the derived canonical proxy to boost the training process. \textbf{3)} Finally, we propose an unpaired try-on synthesizer by constructing pseudo training pairs with randomly misaligned on-model clothes, where intricate skin texture and clothes boundaries can be generated. Extensive experiments on high-resolution () datasets demonstrate the superiority of our approach over state-of-the-art methods both qualitatively and quantitatively. Notably, BVTON shows great generalizability and scalability to various dressing styles and data sources.

Paper Structure

This paper contains 19 sections, 13 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Visual results showing the superiority of our high-fidelity try-on setting, boosted by large-scale unpaired learning. Our high-fidelity try-on pipeline, namely, BVTON preserves the full clothing details (clothing fidelity) including the asymmetric clothing bottom shapes. Conventional virtual try-on methods inherently fail to preserve the complete traits of the target clothes. Here we term "conventional" as the try-on setting used in previous try-on methods that directly preserves the bottom clothes of the reference person regardless of the target clothing shapes. Besides, our method is also capable of an extra application for model-to-model try-on.
  • Figure 2: The overall pipeline of BVTON including the training and inference workflows. Network details are given in Fig. \ref{['fig:networks']}. CCM is first trained with paired data to predict the compositional canonicalizing flow for on-model clothes. We then extract the canonical proxies for the large-scale fashion images, and train the L-MGM with the proxies instead of the in-shop clothes. With predicted layered semantic masks, clothes can be warped accordingly in M-CDM. Finally, UTOM fuses the agnostics and the warped clothes to generate the try-on results.
  • Figure 3: The network design details of the modules used in BVTON. FPN denotes the feature pyramid network.
  • Figure 4: Visual comparison of four virtual try-on methods. The first row shows the conventional setting that directly preserves the bottom clothes, and the second row shows the high-fidelity try-on results. With the help of vastly available fashion images, BVTON can generate realistic results with high clothing fidelity and remarkable skin details. Especially, BVTON generates realistic skin-clothes boundaries instead of simply overlaying the clothes onto the reference person. Visual artifacts are red-boxed.
  • Figure 5: Visual comparison of four virtual try-on methods in VITON and TEST2 test set. We can see that BVTON generalizes well on out-of-domain test data in both the conventional and the high-fidelity setting, which demonstrates the generalizability and scalability of our method. Visual artifacts are red-boxed.
  • ...and 14 more figures