Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images
Aiyu Cui, Jay Mahajan, Viraj Shah, Preeti Gomathinayagam, Chang Liu, Svetlana Lazebnik
TL;DR
This work tackles in-the-wild virtual try-on using unpaired training data by introducing the StreetTryOn benchmark and a DensePose-guided, diffusion-based pipeline. The method decomposes the problem into TryOn Parse Estimation, Warping Correction, and two inpainting steps, enabling garment transfer across diverse street poses and complex backgrounds without paired training data. Key contributions include a StyleGAN-based TryOn Parse Estimator, a DensePose-driven Warping Correction Module, and ControlNet-conditioned diffusion inpainting to remove old garments, reconstruct skin, and refine compositing. Results show strong performance on Street2Street and cross-domain tasks, illustrating robust generalization to real-world imagery and practical potential for consumer-oriented virtual try-on.
Abstract
Most virtual try-on research is motivated to serve the fashion business by generating images to demonstrate garments on studio models at a lower cost. However, virtual try-on should be a broader application that also allows customers to visualize garments on themselves using their own casual photos, known as in-the-wild try-on. Unfortunately, the existing methods, which achieve plausible results for studio try-on settings, perform poorly in the in-the-wild context. This is because these methods often require paired images (garment images paired with images of people wearing the same garment) for training. While such paired data is easy to collect from shopping websites for studio settings, it is difficult to obtain for in-the-wild scenes. In this work, we fill the gap by (1) introducing a StreetTryOn benchmark to support in-the-wild virtual try-on applications and (2) proposing a novel method to learn virtual try-on from a set of in-the-wild person images directly without requiring paired data. We tackle the unique challenges, including warping garments to more diverse human poses and rendering more complex backgrounds faithfully, by a novel DensePose warping correction method combined with diffusion-based conditional inpainting. Our experiments show competitive performance for standard studio try-on tasks and SOTA performance for street try-on and cross-domain try-on tasks.
