Table of Contents
Fetching ...

VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction

Zijian He, Yuwei Ning, Yipeng Qin, Guangrun Wang, Sibei Yang, Liang Lin, Guanbin Li

TL;DR

VTON 360 addresses the challenge of high-fidelity 3D virtual try-on from any viewing direction by reframing 3D VTON as an extension of 2D VTON that leverages multi-view inputs for 3D consistency. It introduces a pseudo-3D pose derived from SMPL-X normals, a multi-view spatial attention mechanism, and a multi-view CLIP conditioning scheme to enforce coherence across views, all trained within a latent diffusion framework and later reconstructed into 3D with Gaussian Splatting. Across Thuman2.0, MVHumanNet, and e-commerce garments, it achieves superior texture preservation and multi-view consistency compared to state-of-the-art baselines, confirmed by quantitative metrics and user studies. The approach has practical impact for immersive online fashion visualization, enabling reliable 360° VTON with realistic garment details and robust cross-view fidelity.

Abstract

Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the equivalence between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views. To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach. Project page: https://scnuhealthy.github.io/VTON360.

VTON 360: High-Fidelity Virtual Try-On from Any Viewing Direction

TL;DR

VTON 360 addresses the challenge of high-fidelity 3D virtual try-on from any viewing direction by reframing 3D VTON as an extension of 2D VTON that leverages multi-view inputs for 3D consistency. It introduces a pseudo-3D pose derived from SMPL-X normals, a multi-view spatial attention mechanism, and a multi-view CLIP conditioning scheme to enforce coherence across views, all trained within a latent diffusion framework and later reconstructed into 3D with Gaussian Splatting. Across Thuman2.0, MVHumanNet, and e-commerce garments, it achieves superior texture preservation and multi-view consistency compared to state-of-the-art baselines, confirmed by quantitative metrics and user studies. The approach has practical impact for immersive online fashion visualization, enabling reliable 360° VTON with realistic garment details and robust cross-view fidelity.

Abstract

Virtual Try-On (VTON) is a transformative technology in e-commerce and fashion design, enabling realistic digital visualization of clothing on individuals. In this work, we propose VTON 360, a novel 3D VTON method that addresses the open challenge of achieving high-fidelity VTON that supports any-view rendering. Specifically, we leverage the equivalence between a 3D model and its rendered multi-view 2D images, and reformulate 3D VTON as an extension of 2D VTON that ensures 3D consistent results across multiple views. To achieve this, we extend 2D VTON models to include multi-view garments and clothing-agnostic human body images as input, and propose several novel techniques to enhance them, including: i) a pseudo-3D pose representation using normal maps derived from the SMPL-X 3D human model, ii) a multi-view spatial attention mechanism that models the correlations between features from different viewing angles, and iii) a multi-view CLIP embedding that enhances the garment CLIP features used in 2D VTON with camera information. Extensive experiments on large-scale real datasets and clothing images from e-commerce platforms demonstrate the effectiveness of our approach. Project page: https://scnuhealthy.github.io/VTON360.

Paper Structure

This paper contains 17 sections, 11 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Results of VTON 360. Our VTON 360 enables high-fidelity 3D Virtual Try-On (VTON) by seamlessly adapting E-commerce garments onto a clothed 3D human model, supporting full 360$^\circ$ view rendering. The highlighted bounding boxes (dashed line) demonstrate our method's ability to preserve intricate clothing details and patterns (e.g., collar accessories, horizontal line patterns, logos, texts, numbers) across diverse garment types.
  • Figure 2: Overview of VTON 360. Given an input 3D human model $\mathbf{G_{\rm src}}$ and a pair of garment images ($g_f$, $g_b$), our method 1) renders $\mathbf{G_{\rm src}}$ into multi-view 2D images (left) and 2) edits the rendered multi-view 2D images (middle); 3) reconstructs the edited images into a 3D model $\mathbf{G_{\rm VTON}}$ (right). In the crucial step 2), we propose three novel techniques to equip a typical 2D VTON network with the capability to generate 3D-consistent results: 1) Pseudo-3D Pose Input, 2) Multi-view Spatial Attention, and 3) Multi-view CLIP Embedding.
  • Figure 3: DensePose (2D) vs. SMPL-X normal map (pseudo-3D) representations. DensePose applies uniform labels per body part, lacking 3D consistency across views and causing artifacts and temporal inconsistencies (highlighted with red boxes). In contrast, SMPL-X normal maps capture fine surface details, ensuring geometric coherence and stable, realistic shading across views.
  • Figure 4: Illustration of the proposed Multi-view Spatial Attention. Query (Q): multi-view features $\mathbf{F^l}$; Key (K) and Value (V): concatenation of $\mathbf{F^l}$ and garment features $F^l_f$ and $F^l_b$. The attention score between viewpoints $i$ and $j$ is modulated by a weight $C_{ij}$, determined by the cosine of the angle between them.
  • Figure 5: Qualitative comparison. The first two rows show the results on Thuman2.0 dataset while the last two rows show the results on MVHumaNet dataset. Our method achieves good texture preservation (highlighted by the blue boxes), while three baseline methods mostly fail.
  • ...and 4 more figures