Table of Contents
Fetching ...

PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Haohua Chen, Tianze Zhou, Wei Zhu, Runqi Wang, Yandong Guan, Dejia Song, Yibo Chen, Xu Tang, Yao Hu, Lu Sheng, Zhiyong Wu

Abstract

Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.

PROMO: Promptable Outfitting for Efficient High-Fidelity Virtual Try-On

Abstract

Virtual Try-on (VTON) has become a core capability for online retail, where realistic try-on results provide reliable fit guidance, reduce returns, and benefit both consumers and merchants. Diffusion-based VTON methods achieve photorealistic synthesis, yet often rely on intricate architectures such as auxiliary reference networks and suffer from slow sampling, making the trade-off between fidelity and efficiency a persistent challenge. We approach VTON as a structured image editing problem that demands strong conditional generation under three key requirements: subject preservation, faithful texture transfer, and seamless harmonization. Under this perspective, our training framework is generic and transfers to broader image editing tasks. Moreover, the paired data produced by VTON constitutes a rich supervisory resource for training general-purpose editors. We present PROMO, a promptable virtual try-on framework built upon a Flow Matching DiT backbone with latent multi-modal conditional concatenation. By leveraging conditioning efficiency and self-reference mechanisms, our approach substantially reduces inference overhead. On standard benchmarks, PROMO surpasses both prior VTON methods and general image editing models in visual fidelity while delivering a competitive balance between quality and speed. These results demonstrate that flow-matching transformers, coupled with latent multi-modal conditioning and self-reference acceleration, offer an effective and training-efficient solution for high-quality virtual try-on.
Paper Structure (28 sections, 13 equations, 12 figures, 5 tables)

This paper contains 28 sections, 13 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: PROMO enables multi-garment try-on, prompt-based control over dressing styles and demonstrates robust performance in challenging real-world scenarios.
  • Figure 2: Overview of scenarios handled by our PROMO framework. The left panel illustrates three common scenarios that customers encounter during online shopping: 1) model image available without a specific garment image, 2) garment image available without a model, and 3) both model and garment images available. Our system addresses the missing information in each scenario by generating the necessary conditional inputs required by our model. Notably, for scenarios without model images, our system features pure image-reference capability, allowing text prompts to be omitted while maintaining a robust baseline performance.
  • Figure 3: Compared to directly using the original DensePose güler2018denseposedensehumanpose, our method better estimates plausible body shapes under loose clothing, effectively preventing information leakage.
  • Figure 4: We downsample the extracted human parsing to match the resolution of the latent space. Compared to the standard supervision in (a), we adopt a weighted loss design in (b).
  • Figure 5: Our method for efficient spatial condition injection. We directly paste the pose condition onto the agnostic image, then perform downsampling, eventually to reduce the 2N tokens to N/4 tokens, achieving 87.5% token reduction.
  • ...and 7 more figures