Table of Contents
Fetching ...

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

Shuliang Ning, Yipeng Qin, Xiaoguang Han

TL;DR

This paper proposes a novel single-network VTON method that overcomes the limitations of existing techniques, and suggests that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

Abstract

Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary "ReferenceNet" to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

TL;DR

This paper proposes a novel single-network VTON method that overcomes the limitations of existing techniques, and suggests that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

Abstract

Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary "ReferenceNet" to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.
Paper Structure (15 sections, 6 equations, 7 figures, 4 tables)

This paper contains 15 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the proposed MN-VTON. Our method achieves high-quality image and video virtual try-on (VTON) through a Modality-Specific Normalization module. Specifically, for multi-modal inputs, we first apply identical AdaLN-zero normalization for similar modality inputs (e.g., reference garment and image/video) and distinct AdaLN-zero normalization for different modalities (e.g., text and visual inputs). Next, we employ shared-weight self-attention across all tokens to enable effective VTON using only a single network.
  • Figure 2: Visualization of garment feature maps $F^{\rm garment}_l$ using PCA at the output of blocks 1, 6 11, 16, 21, 26 of our MN-VTON. [ $\cdot$ , $\cdot$ ] denotes different combinations of input modalities.
  • Figure 3: Visual comparison on the VITONHD dataset. Please zoom in for more details.
  • Figure 4: Visual Comparison on VVT dataset.
  • Figure 5: Qualitative comparison on the VIVID dataset.
  • ...and 2 more figures