Table of Contents
Fetching ...

RefVTON: person-to-person Try on with Additional Unpaired Visual Reference

Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Dengyang Jiang, Zanyi Wang, Dawei Leng, Yuhui Yin

TL;DR

RefTON addresses the gap in realistic virtual try-on by eliminating reliance on auxiliary inputs (like poses and masks) and leveraging unpaired reference images to guide garment texture and structure. Built on Flux-Kontext with a modified position index and a two-stage training regime, it directly fits target garments onto a source person, while the new VRF dataset and a reference-data pipeline enable robust, reference-guided refinement. The method supports both mask-based and mask-free inference and achieves state-of-the-art results on DressCode and VITON-HD, with strong generalization to in-the-wild images. This approach has practical impact for online fashion, offering simpler deployment, higher fidelity for intricate garment details, and flexible integration of additional visual references.

Abstract

We introduce RefTON, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTON streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTON leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTON achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.

RefVTON: person-to-person Try on with Additional Unpaired Visual Reference

TL;DR

RefTON addresses the gap in realistic virtual try-on by eliminating reliance on auxiliary inputs (like poses and masks) and leveraging unpaired reference images to guide garment texture and structure. Built on Flux-Kontext with a modified position index and a two-stage training regime, it directly fits target garments onto a source person, while the new VRF dataset and a reference-data pipeline enable robust, reference-guided refinement. The method supports both mask-based and mask-free inference and achieves state-of-the-art results on DressCode and VITON-HD, with strong generalization to in-the-wild images. This approach has practical impact for online fashion, offering simpler deployment, higher fidelity for intricate garment details, and flexible integration of additional visual references.

Abstract

We introduce RefTON, a flux-based person-to-person virtual try-on framework that enhances garment realism through unpaired visual references. Unlike conventional approaches that rely on complex auxiliary inputs such as body parsing and warped mask or require finely designed extract branches to process various input conditions, RefTON streamlines the process by directly generating try-on results from a source image and a target garment, without the need for structural guidance or auxiliary components to handle diverse inputs. Moreover, inspired by human clothing selection behavior, RefTON leverages additional reference images (the target garment worn on different individuals) to provide powerful guidance for refining texture alignment and maintaining the garment details. To enable this capability, we built a dataset containing unpaired reference images for training. Extensive experiments on public benchmarks demonstrate that RefTON achieves competitive or superior performance compared to state-of-the-art methods, while maintaining a simple and efficient person-to-person design.

Paper Structure

This paper contains 21 sections, 3 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: In-the-wild try-on results generated by our RefVTON model with a p2p style, trained on person and garment images from our Virtual Fitting with Reference (VFR) dataset.The first row demonstrates our mask-free try-on capability, where the garment is transferred directly to the target person without masks or pose estimation. The second row shows our additional-reference try-on mode, in which extra visual references are incorporated to enhance structural accuracy, texture fidelity, and overall realism.
  • Figure 2: The effect of using reference images for the virtual try-on task. From left to right in the three middle subfigures are: (i) results generated without using reference images during either training or inference; (ii) results generated by a model trained and inferred with reference images. Incorporating reference images consistently improves the try-on quality and authenticity in both training and inference stages. Please zoom in for more details.
  • Figure 3: The pipeline of our two-stage training strategy: (a) In the first stage, which follows a similar paradigm to mask-based try-on approaches, the model is trained on masked person images to generate person images wearing random garments for the next stage training. (b) In the second stage, the synthesized person images produced in the first stage, along with the target garment and additional reference images (optional), are jointly used as inputs to train a person-to-person virtual try-on model that directly fits the target cloth onto the person's body.
  • Figure 4: Adaptation of a three-channel position index: the first channel encodes different conditional inputs, while the second and third channels provide spatial positional information for adapting the resolution of the target inputs.
  • Figure 5: The overall pipeline of generating the reference images. We first generate the appearance descriptions using Qwen2.5-VLbai2025qwen2, and then concatenate the appearance with the corresponding actions and outfits to construct the positive and negative prompts, as shown in (a). Subsequently, the images and the textual prompts are fed into the Editing Model, which generates photos of individuals wearing the same clothes. These results are reference images for each image–cloth pair, as shown in (b).
  • ...and 10 more figures