Table of Contents
Fetching ...

EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process

Mostafa Atef, Mariam Ayman, Ahmed Rashed, Ashrakat Saeed, Abdelrahman Saeed, Ahmed Fares

TL;DR

EfficientVITON addresses the challenge of realistic and efficient virtual try-on by integrating a pre-trained Stable Diffusion backbone with a spatial encoder and zero cross-attention to learn latent garment–body alignment. It introduces non-uniform timestep sampling to speed up the diffusion process and employs a two-stage training regime with an attention-total-variation loss to refine alignment and details. The approach yields state-of-the-art results on VITON-HD in both quality (low $L_{D M}$-style loss and favorable $L_{ATV}$-based regularization) and efficiency, achieving substantial reductions in training and inference time. This work has practical implications for real-time e-commerce applications, enabling high-fidelity try-on experiences with reduced computational demands, while also offering generalizable techniques for other image-synthesis tasks.

Abstract

Would not it be much more convenient for everybody to try on clothes by only looking into a mirror ? The answer to that problem is virtual try-on, enabling users to digitally experiment with outfits. The core challenge lies in realistic image-to-image translation, where clothing must fit diverse human forms, poses, and figures. Early methods, which used 2D transformations, offered speed, but image quality was often disappointing and lacked the nuance of deep learning. Though GAN-based techniques enhanced realism, their dependence on paired data proved limiting. More adaptable methods offered great visuals but demanded significant computing power and time. Recent advances in diffusion models have shown promise for high-fidelity translation, yet the current crop of virtual try-on tools still struggle with detail loss and warping issues. To tackle these challenges, this paper proposes EfficientVITON, a new virtual try-on system leveraging the impressive pre-trained Stable Diffusion model for better images and deployment feasibility. The system includes a spatial encoder to maintain clothings finer details and zero cross-attention blocks to capture the subtleties of how clothes fit a human body. Input images are carefully prepared, and the diffusion process has been tweaked to significantly cut generation time without image quality loss. The training process involves two distinct stages of fine-tuning, carefully incorporating a balance of loss functions to ensure both accurate try-on results and high-quality visuals. Rigorous testing on the VITON-HD dataset, supplemented with real-world examples, has demonstrated that EfficientVITON achieves state-of-the-art results.

EfficientVITON: An Efficient Virtual Try-On Model using Optimized Diffusion Process

TL;DR

EfficientVITON addresses the challenge of realistic and efficient virtual try-on by integrating a pre-trained Stable Diffusion backbone with a spatial encoder and zero cross-attention to learn latent garment–body alignment. It introduces non-uniform timestep sampling to speed up the diffusion process and employs a two-stage training regime with an attention-total-variation loss to refine alignment and details. The approach yields state-of-the-art results on VITON-HD in both quality (low -style loss and favorable -based regularization) and efficiency, achieving substantial reductions in training and inference time. This work has practical implications for real-time e-commerce applications, enabling high-fidelity try-on experiences with reduced computational demands, while also offering generalizable techniques for other image-synthesis tasks.

Abstract

Would not it be much more convenient for everybody to try on clothes by only looking into a mirror ? The answer to that problem is virtual try-on, enabling users to digitally experiment with outfits. The core challenge lies in realistic image-to-image translation, where clothing must fit diverse human forms, poses, and figures. Early methods, which used 2D transformations, offered speed, but image quality was often disappointing and lacked the nuance of deep learning. Though GAN-based techniques enhanced realism, their dependence on paired data proved limiting. More adaptable methods offered great visuals but demanded significant computing power and time. Recent advances in diffusion models have shown promise for high-fidelity translation, yet the current crop of virtual try-on tools still struggle with detail loss and warping issues. To tackle these challenges, this paper proposes EfficientVITON, a new virtual try-on system leveraging the impressive pre-trained Stable Diffusion model for better images and deployment feasibility. The system includes a spatial encoder to maintain clothings finer details and zero cross-attention blocks to capture the subtleties of how clothes fit a human body. Input images are carefully prepared, and the diffusion process has been tweaked to significantly cut generation time without image quality loss. The training process involves two distinct stages of fine-tuning, carefully incorporating a balance of loss functions to ensure both accurate try-on results and high-quality visuals. Rigorous testing on the VITON-HD dataset, supplemented with real-world examples, has demonstrated that EfficientVITON achieves state-of-the-art results.
Paper Structure (14 sections, 1 equation, 13 figures, 2 tables)

This paper contains 14 sections, 1 equation, 13 figures, 2 tables.

Figures (13)

  • Figure 1: OpenPose Output.
  • Figure 2: LIP Parsing Output.
  • Figure 3: Agnostic Image Output.
  • Figure 4: Agnostic Mask Output.
  • Figure 5: Parse Agnostic Image Output.
  • ...and 8 more figures