Table of Contents
Fetching ...

DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning

Jiabao Wei, Zhiyuan Ma

TL;DR

The proposed DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module, outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

Abstract

Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,"neckline" or "girdle") to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

DH-VTON: Deep Text-Driven Virtual Try-On via Hybrid Attention Learning

TL;DR

The proposed DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module, outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

Abstract

Virtual Try-ON (VTON) aims to synthesis specific person images dressed in given garments, which recently receives numerous attention in online shopping scenarios. Currently, the core challenges of the VTON task mainly lie in the fine-grained semantic extraction (i.e.,deep semantics) of the given reference garments during depth estimation and effective texture preservation when the garments are synthesized and warped onto human body. To cope with these issues, we propose DH-VTON, a deep text-driven virtual try-on model featuring a special hybrid attention learning strategy and deep garment semantic preservation module. By standing on the shoulder of a well-built pre-trained paint-by-example (abbr. PBE) approach, we present our DH-VTON pipeline in this work. Specifically, to extract the deep semantics of the garments, we first introduce InternViT-6B as fine-grained feature learner, which can be trained to align with the large-scale intrinsic knowledge with deep text semantics (e.g.,"neckline" or "girdle") to make up for the deficiency of the commonly adopted CLIP encoder. Based on this, to enhance the customized dressing abilities, we further introduce Garment-Feature ControlNet Plus (abbr. GFC+) module and propose to leverage a fresh hybrid attention strategy for training, which can adaptively integrate fine-grained characteristics of the garments into the different layers of the VTON model, so as to achieve multi-scale features preservation effects. Extensive experiments on several representative datasets demonstrate that our method outperforms previous diffusion-based and GAN-based approaches, showing competitive performance in preserving garment details and generating authentic human images.

Paper Structure

This paper contains 17 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Demonstration of dynamic high-resolution capabilities. InternViT-6B-448px-V1-5internvit-6b dynamically match an optimal aspect ratio from pre-defined ratios, dividing the image into tiles of $448\times448$ pixels and creating a thumbnail for global context.
  • Figure 2: Overview of DH-VTON. We demonstrate the training pipeline of our DH-VTON and details of the attention block. (Left) DH-VTON comprises a fixed-parameter PBEPBE and a trainable GFC+. Apart from the given noisy image $\mathbf{x}_t$, mask $m$, masked image $\mathbf{x}_0'$, garment image $g$, time steps $t$, GFC+ generates a set of control vectors $c_t$ by incorporating additional control conditions, such as pose $p$ and densepose $d$. Control vectors are integrated into PBEPBE to enhance the model's controllability while preserving PBE's generation capabilities. (Right) We introduce a hybrid attention strategy in GFC+ to ensemble different layers of fine-grained characteristics for multi-scale features preservation.
  • Figure 3: Qualitative results on VITON-HD and DressCode test datasets. Please zoom in for more details.
  • Figure 4: Effect of $\lambda$. We compare the results of DH-VTON trained without/with hybrid attention strategy and using different values of $\lambda$.
  • Figure 5: Effect of InternViT-6B. We compare the results of DH-VTON when using different feature extractors.
  • ...and 1 more figures