Table of Contents
Fetching ...

Mobile-VTON: High-Fidelity On-Device Virtual Try-On

Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong

TL;DR

Experiments demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications, and proposes a Feature-Guided Adversarial Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions.

Abstract

Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.

Mobile-VTON: High-Fidelity On-Device Virtual Try-On

TL;DR

Experiments demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications, and proposes a Feature-Guided Adversarial Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions.

Abstract

Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
Paper Structure (14 sections, 9 equations, 9 figures, 5 tables)

This paper contains 14 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Comparison of virtual try-on methods in terms of model size, mobile compatibility, and visual quality. Our model, with only 415M parameters, achieves competitive visual results while running entirely on mobile devices. The leftmost column shows the input person and garment images. All outputs are generated or super-resolved at $1024 \times 768$ resolution. Zoom in for details.
  • Figure 2: Training overview of Mobile-VTON. The left side illustrates our Feature-Guided Adversarial (FGA) Distillation process, where a high-capacity TeacherNet supervises two lightweight student networks: GarmentNet and TryonNet. GarmentNet learns to reconstruct the garment using feature-level distillation loss $\mathcal{L}_{\text{feature}}^{\text{G}}$, while TryonNet is guided by both feature-level loss $\mathcal{L}_{\text{feature}}^{\text{T}}$ and adversarial loss $\mathcal{L}_{\text{GAN}}$. The right side shows the main training pipeline of Mobile-VTON, where TryonNet and GarmentNet are jointly trained with garment-aware supervision. The Light-Adapter injects garment semantics, and the model is optimized using latent concatenation, temporal consistency loss $\mathcal{L}_{\text{cons}}$, and reconstruction loss $\mathcal{L}_{\text{Diff}}$ to generate high-fidelity try-on images.
  • Figure 3: Qualitative comparison on VITON-HD and DressCode between our approach and server-side baselines. Zoom in for finer details.
  • Figure 4: Comparison of virtual try-on results with and without the TCG and LC. Zoom in for more details.
  • Figure 5: Qualitative ablation results on the DressCode upper-body test set. Removing key components degrades garment detail, structural consistency, or overall synthesis quality.
  • ...and 4 more figures