Table of Contents
Fetching ...

CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

Jianhao Zeng, Dan Song, Weizhi Nie, Hongshuo Tian, Tongtong Wang, Anan Liu

TL;DR

This work addresses the dual challenges of controllability and speed in diffusion-model-based virtual try-on. It introduces CAT-DM, which combines a Garment-Conditioned Diffusion Model (GC-DM) with a truncation-based acceleration strategy that seeds diffusion from a pre-trained GAN, enabling rapid, high-fidelity garment synthesis. GC-DM leverages ControlNet and enhanced garment feature extraction (via DINO-V2) to preserve garment patterns while sorting blending with the original image through Poisson blending. On DressCode and VITON-HD, CAT-DM achieves state-of-the-art realism and garment detail with as few as two diffusion steps, offering substantial speedups over traditional diffusion methods and competitive performance relative to GAN-based baselines.

Abstract

Generative Adversarial Networks (GANs) dominate the research field in image-based virtual try-on, but have not resolved problems such as unnatural deformation of garments and the blurry generation quality. While the generative quality of diffusion models is impressive, achieving controllability poses a significant challenge when applying it to virtual try-on and multiple denoising iterations limit its potential for real-time applications. In this paper, we propose Controllable Accelerated virtual Try-on with Diffusion Model (CAT-DM). To enhance the controllability, a basic diffusion-based virtual try-on network is designed, which utilizes ControlNet to introduce additional control conditions and improves the feature extraction of garment images. In terms of acceleration, CAT-DM initiates a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model. Compared with previous try-on methods based on diffusion models, CAT-DM not only retains the pattern and texture details of the inshop garment but also reduces the sampling steps without compromising generation quality. Extensive experiments demonstrate the superiority of CAT-DM against both GANbased and diffusion-based methods in producing more realistic images and accurately reproducing garment patterns.

CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model

TL;DR

This work addresses the dual challenges of controllability and speed in diffusion-model-based virtual try-on. It introduces CAT-DM, which combines a Garment-Conditioned Diffusion Model (GC-DM) with a truncation-based acceleration strategy that seeds diffusion from a pre-trained GAN, enabling rapid, high-fidelity garment synthesis. GC-DM leverages ControlNet and enhanced garment feature extraction (via DINO-V2) to preserve garment patterns while sorting blending with the original image through Poisson blending. On DressCode and VITON-HD, CAT-DM achieves state-of-the-art realism and garment detail with as few as two diffusion steps, offering substantial speedups over traditional diffusion methods and competitive performance relative to GAN-based baselines.

Abstract

Generative Adversarial Networks (GANs) dominate the research field in image-based virtual try-on, but have not resolved problems such as unnatural deformation of garments and the blurry generation quality. While the generative quality of diffusion models is impressive, achieving controllability poses a significant challenge when applying it to virtual try-on and multiple denoising iterations limit its potential for real-time applications. In this paper, we propose Controllable Accelerated virtual Try-on with Diffusion Model (CAT-DM). To enhance the controllability, a basic diffusion-based virtual try-on network is designed, which utilizes ControlNet to introduce additional control conditions and improves the feature extraction of garment images. In terms of acceleration, CAT-DM initiates a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model. Compared with previous try-on methods based on diffusion models, CAT-DM not only retains the pattern and texture details of the inshop garment but also reduces the sampling steps without compromising generation quality. Extensive experiments demonstrate the superiority of CAT-DM against both GANbased and diffusion-based methods in producing more realistic images and accurately reproducing garment patterns.
Paper Structure (14 sections, 5 equations, 10 figures, 3 tables)

This paper contains 14 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: CAT-DM not only enhances the controllability of the image generation process for virtual try-on but also effectively accelerates the sampling speed of the diffusion models. Top: Comparison results with other methods. CAT-DM accurately generates the pattern details on garments and produces images that are sufficiently clear. Bottom: CAT-DM requires fewer sampling steps than other diffusion models to generate clear and realistic virtual try-on images. Compared to the default 50 sampling steps of DCI-VTON DCI-VTON, CAT-DM achieves a 25-fold acceleration.
  • Figure 2: The training pipeline of the GC-DM in our method. GC-DM comprises a fixed-parameter PBE and a trainable ControlNet. Apart from the given noisy image $\mathbf{x}_t$, time steps $t$, mask $m$, masked image $\mathbf{x}_0'$ and garment image $g$, ControlNet generates a set of control vectors $c_t$ by incorporating additional control conditions, such as densepose $p$. Control vectors are incorporated into the PBE to enhance the model's controllability while preserving the PBE's generative capabilities.
  • Figure 3: Illustration of different sampling methods in diffusion models. (A) The conventional DDPMs DDPM denoise gradually with a large number of time steps $T$. (B) DDIMs DDIM employ a class of non-Markovian diffusion processes to denoise gradually. Compared to DDPMs, DDIMs requires fewer sampling steps, that is, $N\ll T$. (C) TDPM TDPM repurposes the parameter of the diffusion model to generate the implicit distribution at step $T_{\text{trunc}}$, using it as the initial sample for the reverse diffusion process. This approach accelerates sampling, resulting in $T_{\text{trunc}}\ll T$. (D) CAT-DM utilizes a pre-trained GAN-based model to generate an initial try-on image $\bar{\mathbf{x}}$, which is then subjected to noise addition, making the noisy image $\mathbf{x}_{T_{\text{trunc}}}$ as the starting point of the reverse diffusion process.
  • Figure 4: Results from different types of generation methods. The directly generated try-on images exhibit noticeable distortion in the face region. The results obtained through image concatenating have incongruities at the junctions of the images. This issue is resolved in the results obtained using Poisson blending.
  • Figure 5: Comparative analysis of our method (CAT-DM) with other techniques using the VITON-HD dataset VITON-HD, focusing on the realism of results (better at the bottom left) and the number of trainable parameters (smaller is better). The unpaired setting is on the left, and the paired setting is on the right.
  • ...and 5 more figures