1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

Shuliang Ning; Yipeng Qin; Xiaoguang Han

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

Shuliang Ning, Yipeng Qin, Xiaoguang Han

TL;DR

This paper proposes a novel single-network VTON method that overcomes the limitations of existing techniques, and suggests that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

Abstract

Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary "ReferenceNet" to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

TL;DR

Abstract

Paper Structure (15 sections, 6 equations, 7 figures, 4 tables)

This paper contains 15 sections, 6 equations, 7 figures, 4 tables.

Introduction
Related Work
Method
Preliminaries
Feature Split, Normalization and Fusion
Modality-Specific Normalization
Image-Video Joint training Strategy
Experiment
Experimental Setup
Generalization across Network Architectures
Comparison with SOTA Methods
Ablation Study
User Study
Application-Parsing Free Virtual Tryon
Conclusion

Figures (7)

Figure 1: Overview of the proposed MN-VTON. Our method achieves high-quality image and video virtual try-on (VTON) through a Modality-Specific Normalization module. Specifically, for multi-modal inputs, we first apply identical AdaLN-zero normalization for similar modality inputs (e.g., reference garment and image/video) and distinct AdaLN-zero normalization for different modalities (e.g., text and visual inputs). Next, we employ shared-weight self-attention across all tokens to enable effective VTON using only a single network.
Figure 2: Visualization of garment feature maps $F^{\rm garment}_l$ using PCA at the output of blocks 1, 6 11, 16, 21, 26 of our MN-VTON. [ $\cdot$ , $\cdot$ ] denotes different combinations of input modalities.
Figure 3: Visual comparison on the VITONHD dataset. Please zoom in for more details.
Figure 4: Visual Comparison on VVT dataset.
Figure 5: Qualitative comparison on the VIVID dataset.
...and 2 more figures

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

TL;DR

Abstract

1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On

Authors

TL;DR

Abstract

Table of Contents

Figures (7)