Table of Contents
Fetching ...

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

Zhenyu Xie, Haoye Dong, Yufei Gao, Zehua Ma, Xiaodan Liang

TL;DR

DreamVTON tackles image-based 3D virtual try-on by decoupling geometry and texture optimization and injecting personalized priors through a multi-concept LoRA and Densepose-guided ControlNet. A template-based optimization scheme mitigates the dominance of personalized diffusion priors across views, while a normal-style LoRA enhances geometry realism. The method combines SDS-based losses with template-derived supervision to produce high-fidelity 3D humans wearing input clothes, preserving identity and clothing style across viewpoints. Experiments show DreamVTON outperforms baselines in geometry, texture, and user-perceived realism, unlocking practical, data-efficient 3D VTON with personalized clothing constraints.

Abstract

Image-based 3D Virtual Try-ON (VTON) aims to sculpt the 3D human according to person and clothes images, which is data-efficient (i.e., getting rid of expensive 3D data) but challenging. Recent text-to-3D methods achieve remarkable improvement in high-fidelity 3D human generation, demonstrating its potential for 3D virtual try-on. Inspired by the impressive success of personalized diffusion models (e.g., Dreambooth and LoRA) for 2D VTON, it is straightforward to achieve 3D VTON by integrating the personalization technique into the diffusion-based text-to-3D framework. However, employing the personalized module in a pre-trained diffusion model (e.g., StableDiffusion (SD)) would degrade the model's capability for multi-view or multi-domain synthesis, which is detrimental to the geometry and texture optimization guided by Score Distillation Sampling (SDS) loss. In this work, we propose a novel customizing 3D human try-on model, named \textbf{DreamVTON}, to separately optimize the geometry and texture of the 3D human. Specifically, a personalized SD with multi-concept LoRA is proposed to provide the generative prior about the specific person and clothes, while a Densepose-guided ControlNet is exploited to guarantee consistent prior about body pose across various camera views. Besides, to avoid the inconsistent multi-view priors from the personalized SD dominating the optimization, DreamVTON introduces a template-based optimization mechanism, which employs mask templates for geometry shape learning and normal/RGB templates for geometry/texture details learning. Furthermore, for the geometry optimization phase, DreamVTON integrates a normal-style LoRA into personalized SD to enhance normal map generative prior, facilitating smooth geometry modeling.

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

TL;DR

DreamVTON tackles image-based 3D virtual try-on by decoupling geometry and texture optimization and injecting personalized priors through a multi-concept LoRA and Densepose-guided ControlNet. A template-based optimization scheme mitigates the dominance of personalized diffusion priors across views, while a normal-style LoRA enhances geometry realism. The method combines SDS-based losses with template-derived supervision to produce high-fidelity 3D humans wearing input clothes, preserving identity and clothing style across viewpoints. Experiments show DreamVTON outperforms baselines in geometry, texture, and user-perceived realism, unlocking practical, data-efficient 3D VTON with personalized clothing constraints.

Abstract

Image-based 3D Virtual Try-ON (VTON) aims to sculpt the 3D human according to person and clothes images, which is data-efficient (i.e., getting rid of expensive 3D data) but challenging. Recent text-to-3D methods achieve remarkable improvement in high-fidelity 3D human generation, demonstrating its potential for 3D virtual try-on. Inspired by the impressive success of personalized diffusion models (e.g., Dreambooth and LoRA) for 2D VTON, it is straightforward to achieve 3D VTON by integrating the personalization technique into the diffusion-based text-to-3D framework. However, employing the personalized module in a pre-trained diffusion model (e.g., StableDiffusion (SD)) would degrade the model's capability for multi-view or multi-domain synthesis, which is detrimental to the geometry and texture optimization guided by Score Distillation Sampling (SDS) loss. In this work, we propose a novel customizing 3D human try-on model, named \textbf{DreamVTON}, to separately optimize the geometry and texture of the 3D human. Specifically, a personalized SD with multi-concept LoRA is proposed to provide the generative prior about the specific person and clothes, while a Densepose-guided ControlNet is exploited to guarantee consistent prior about body pose across various camera views. Besides, to avoid the inconsistent multi-view priors from the personalized SD dominating the optimization, DreamVTON introduces a template-based optimization mechanism, which employs mask templates for geometry shape learning and normal/RGB templates for geometry/texture details learning. Furthermore, for the geometry optimization phase, DreamVTON integrates a normal-style LoRA into personalized SD to enhance normal map generative prior, facilitating smooth geometry modeling.
Paper Structure (22 sections, 8 equations, 12 figures, 2 tables)

This paper contains 22 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: (a) Try-on results of the personalized SD. (b) Visual comparison between results of SD with and without LoRA directly. Using LoRA directly will reduce the capability of SD for multi-view synthesis.
  • Figure 2: Method Overview. DreamVTON can generate realistic 3D humans given person images, clothes images, and a text prompt. We disentangle the 3D try-on into geometry and appearance learning and design a Multi-concept LoRA and a Normal-style LoRA. Furthermore, we employ a templated-based optimization to achieve high-quality geometry and detailed texture.
  • Figure 3: (a) Sample of training data for multi-concept LoRA. (b) Sample of training data for normal-style LoRA. (c) Sample results of multi-concept LoRA and Normal-style LoRA.
  • Figure 4: Qualitative Comparisons. Using the same clothes images, person images, and text prompt as inputs, our method achieves superior results.
  • Figure 5: Ablation study for geometry optimization.
  • ...and 7 more figures