Table of Contents
Fetching ...

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang

TL;DR

The paper tackles two core issues in diffusion-based virtual try-on: preserving fine-grained garment identity and improving training efficiency. It introduces TryOn-Adapter, which factorizes clothing identity into style, texture, and structure, and frictionlessly injects corresponding cues into a frozen diffusion backbone via three lightweight adapters and a training-free T-RePaint strategy. An enhanced latent blending module further stabilizes image synthesis, enabling high-fidelity results with significantly fewer trainable parameters than full fine-tuning. Empirical results on VITON-HD and Dresscode show state-of-the-art identity preservation and realism, validating the approach and its practical potential for efficient, controllable virtual try-on systems. The work also provides detailed ablations and analyses, underscoring the contributions of each component and offering a clear path toward scalable deployment.

Abstract

Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

TL;DR

The paper tackles two core issues in diffusion-based virtual try-on: preserving fine-grained garment identity and improving training efficiency. It introduces TryOn-Adapter, which factorizes clothing identity into style, texture, and structure, and frictionlessly injects corresponding cues into a frozen diffusion backbone via three lightweight adapters and a training-free T-RePaint strategy. An enhanced latent blending module further stabilizes image synthesis, enabling high-fidelity results with significantly fewer trainable parameters than full fine-tuning. Empirical results on VITON-HD and Dresscode show state-of-the-art identity preservation and realism, validating the approach and its practical potential for efficient, controllable virtual try-on systems. The work also provides detailed ablations and analyses, underscoring the contributions of each component and offering a clear path toward scalable deployment.

Abstract

Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.
Paper Structure (14 sections, 10 equations, 19 figures, 9 tables)

This paper contains 14 sections, 10 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Performance comparison of three different methods on VITON-HD dataset at 512 $\times$ 384 resolution, including our TryOn-Adapter, GANs-based method HR-VITON rombach2022high, Diffusion-based method LaDI-VTON morelli2023ladi and StableVITON kim2023stableviton. Our method generates high-quality results and exhibits strong clothing identity preservation capability, i.e., consistent color style and logo textures, as well as a smooth transition between long and short sleeves.
  • Figure 2: The overall architecture of our TryOn-Adapter is composed of five parts: 1) the pre-trained stable diffusion model with fixed parameters except for attention layers; 2) the Style Preserving module aimed to preserve the overall style of the garment, including color and category information; 3) the Texture Highlighting module focuses on refining the high-frequency details. 4) the Structure Adapting module compensates for unnatural areas caused by clothing changes. 5) the Enhanced Latent Blending Module focuses on consistent visual quality.
  • Figure 3: The architecture of the style adapter.
  • Figure 4: (a): Visual illustration for the texture highlighting map generation. (b): Visual illustration for the target segmentation map generation.
  • Figure 5: (a): The architecture of the texture and segmentation adapter. Every ResBlock consists of a convolution layer, two resnet layers, and two position attention modules. (b): The architecture of the position attention module.
  • ...and 14 more figures