OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Ke Sun; Jian Cao; Qi Wang; Linrui Tian; Xindi Zhang; Lian Zhuo; Bang Zhang; Liefeng Bo; Wenbo Zhou; Weiming Zhang; Daiheng Gao

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Ke Sun, Jian Cao, Qi Wang, Linrui Tian, Xindi Zhang, Lian Zhuo, Bang Zhang, Liefeng Bo, Wenbo Zhou, Weiming Zhang, Daiheng Gao

TL;DR

OutfitAnyone tackles the core challenge of ultra-high-fidelity virtual try-on under diverse poses, body shapes, and backgrounds by deploying a two-stream conditional diffusion framework. The method combines a dual-path diffusion process with a ReferenceNet-based garment feature injection, a tailored classifier-free guidance strategy, and a background/lighting retention approach, supplemented by a Pose-and-Shape Guider and a Post-hoc Refiner to achieve photorealistic results. Key contributions include supporting single and multi-piece outfits, maintaining garment texture across varied subjects (including anime and selfies), and enabling design-assisted workflows, with strong performance and robustness across real-world scenarios. The work demonstrates practical impact for online shopping, content creation, and design, and cites a successful open-source trajectory and community engagement.

Abstract

Virtual Try-On (VTON) has become a transformative technology, empowering users to experiment with fashion without ever having to physically try on clothing. However, existing methods often struggle with generating high-fidelity and detail-consistent results. While diffusion models, such as Stable Diffusion series, have shown their capability in creating high-quality and photorealistic images, they encounter formidable challenges in conditional generation scenarios like VTON. Specifically, these models struggle to maintain a balance between control and consistency when generating images for virtual clothing trials. OutfitAnyone addresses these limitations by leveraging a two-stream conditional diffusion model, enabling it to adeptly handle garment deformation for more lifelike results. It distinguishes itself with scalability-modulating factors such as pose, body shape and broad applicability, extending from anime to in-the-wild images. OutfitAnyone's performance in diverse scenarios underscores its utility and readiness for real-world deployment. For more details and animated results, please see \url{https://humanaigc.github.io/outfit-anyone/}.

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

TL;DR

Abstract

Paper Structure (17 sections, 13 figures)

This paper contains 17 sections, 13 figures.

Introduction
Related Works
Overall Framework
Clothing Feature Injection
Classifier-Free Guidance
Background and Lighting Retention
Pose and Shape Guider
Detail Refiner
Results
Any Outfits
Any Person
Any Body Shape
Any Background
Refiner
Comparsion
...and 2 more sections

Figures (13)

Figure 1: We introduce OutfitAnyone, a diffusion-based framework for 2D Virtual Try-On. By far, it has garnered over 5,000 stars on GitHub and ranked within the top 20 among all the Hugging Face spaces.
Figure 2: Method overview: OutfitAnyone processes input consisting of a model, garment, and related prompts through a dual-path conditional diffusion model. This model bifurcates into two distinct pathways, each dedicated to handling the model and garment data independently. The two streams eventually merge within a fusion network, which effectively integrates the garment details into the model's feature representation. To elaborate, we extract features: openpose (can be replaced by densepose or SMPL) and initmask from the model, and then concatenate these features with the model image. This composite data is then fed into our Dual-Path SD model, which guarantees not only the high-quality retention but also the restoration of the garment's features. Importantly, the feature spaces for both models and garments are aligned, which significantly accelerates the convergence process (with a visible try-on effect achievable within just 6k iterations). Significantly, the prompt, although it may not align perfectly with the spatial pixels, plays a crucial role in preserving semantic-level information.
Figure 3: Refiner takes the coarse output from the dual-path conditional diffusion model as its starting point and further enhances it through our subsequent refinement process.
Figure 4: Virtual Try-On with different Outfits.
Figure 5: Virtual Try-On for kids.
...and 8 more figures

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

TL;DR

Abstract

OutfitAnyone: Ultra-high Quality Virtual Try-On for Any Clothing and Any Person

Authors

TL;DR

Abstract

Table of Contents

Figures (13)