Table of Contents
Fetching ...

Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images

Aiyu Cui, Jay Mahajan, Viraj Shah, Preeti Gomathinayagam, Chang Liu, Svetlana Lazebnik

TL;DR

This work tackles in-the-wild virtual try-on using unpaired training data by introducing the StreetTryOn benchmark and a DensePose-guided, diffusion-based pipeline. The method decomposes the problem into TryOn Parse Estimation, Warping Correction, and two inpainting steps, enabling garment transfer across diverse street poses and complex backgrounds without paired training data. Key contributions include a StyleGAN-based TryOn Parse Estimator, a DensePose-driven Warping Correction Module, and ControlNet-conditioned diffusion inpainting to remove old garments, reconstruct skin, and refine compositing. Results show strong performance on Street2Street and cross-domain tasks, illustrating robust generalization to real-world imagery and practical potential for consumer-oriented virtual try-on.

Abstract

Most virtual try-on research is motivated to serve the fashion business by generating images to demonstrate garments on studio models at a lower cost. However, virtual try-on should be a broader application that also allows customers to visualize garments on themselves using their own casual photos, known as in-the-wild try-on. Unfortunately, the existing methods, which achieve plausible results for studio try-on settings, perform poorly in the in-the-wild context. This is because these methods often require paired images (garment images paired with images of people wearing the same garment) for training. While such paired data is easy to collect from shopping websites for studio settings, it is difficult to obtain for in-the-wild scenes. In this work, we fill the gap by (1) introducing a StreetTryOn benchmark to support in-the-wild virtual try-on applications and (2) proposing a novel method to learn virtual try-on from a set of in-the-wild person images directly without requiring paired data. We tackle the unique challenges, including warping garments to more diverse human poses and rendering more complex backgrounds faithfully, by a novel DensePose warping correction method combined with diffusion-based conditional inpainting. Our experiments show competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.

Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images

TL;DR

This work tackles in-the-wild virtual try-on using unpaired training data by introducing the StreetTryOn benchmark and a DensePose-guided, diffusion-based pipeline. The method decomposes the problem into TryOn Parse Estimation, Warping Correction, and two inpainting steps, enabling garment transfer across diverse street poses and complex backgrounds without paired training data. Key contributions include a StyleGAN-based TryOn Parse Estimator, a DensePose-driven Warping Correction Module, and ControlNet-conditioned diffusion inpainting to remove old garments, reconstruct skin, and refine compositing. Results show strong performance on Street2Street and cross-domain tasks, illustrating robust generalization to real-world imagery and practical potential for consumer-oriented virtual try-on.

Abstract

Most virtual try-on research is motivated to serve the fashion business by generating images to demonstrate garments on studio models at a lower cost. However, virtual try-on should be a broader application that also allows customers to visualize garments on themselves using their own casual photos, known as in-the-wild try-on. Unfortunately, the existing methods, which achieve plausible results for studio try-on settings, perform poorly in the in-the-wild context. This is because these methods often require paired images (garment images paired with images of people wearing the same garment) for training. While such paired data is easy to collect from shopping websites for studio settings, it is difficult to obtain for in-the-wild scenes. In this work, we fill the gap by (1) introducing a StreetTryOn benchmark to support in-the-wild virtual try-on applications and (2) proposing a novel method to learn virtual try-on from a set of in-the-wild person images directly without requiring paired data. We tackle the unique challenges, including warping garments to more diverse human poses and rendering more complex backgrounds faithfully, by a novel DensePose warping correction method combined with diffusion-based conditional inpainting. Our experiments show competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.
Paper Structure (27 sections, 5 equations, 20 figures, 2 tables)

This paper contains 27 sections, 5 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Our proposed Street TryOn benchmark and method for in-the-wild try-on application, contrasted with existing work focusing on controlled studio images and paired training (see text).
  • Figure 2: Overview of our proposed virtual try-on method (see text for details).
  • Figure 3: Left: TryOn Parse Estimator (Section \ref{['sec:parse']}). Right: Warping Correction Module (Section \ref{['sec:warp']}).
  • Figure 4: Street2Street Try-On examples for our method.
  • Figure 5: (a)-top: Model2Street. (a)-bottom: Model2Model. (b)-top: Shop2Street. (b)-bottom: Shop2Model.
  • ...and 15 more figures