Table of Contents
Fetching ...

CORAL: Correspondence Alignment for Improved Virtual Try-On

Jiyoung Kim, Youngjin Shin, Siyoon Jin, Dahyun Chung, Jisu Nam, Tongmin Kim, Jongjae Park, Hyeonwoo Kang, Seungryong Kim

TL;DR

This work tackles the challenge of preserving fine garment details in virtual try-on (VTON) under unpaired and cross-category conditions by exposing and refining the internal person–garment correspondence within Diffusion Transformer (DiT) full 3D attention. It introduces CORAL, which aligns query–key matches with robust external correspondences via a correspondence distillation loss $\mathcal{L}_{\text{corr}}$ and an entropy minimization loss $\mathcal{L}_{\text{ent}}$, integrated into a two-panel diptych DiT architecture. The approach yields state-of-the-art results on standard VTON benchmarks, a VLM-based evaluation protocol, and in-the-wild datasets, with ablations confirming the complementary benefits of the proposed losses. By sharpening spatial attention and grounding it to reliable correspondences, CORAL enhances both global garment shape transfer and local detail fidelity, providing a practical improvement for real-world VTON applications and suggesting avenues for extending correspondence supervision to broader customization tasks. $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{velocity}} + \lambda_{\text{corr}}\mathcal{L}_{\text{corr}} + \lambda_{\text{ent}}\mathcal{L}_{\text{ent}}$ and $A^{t,l}_{\mathcal{P}\rightarrow\mathcal{G}}$ serve as central constructs for guiding and evaluating alignment within the DiT model.

Abstract

Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.

CORAL: Correspondence Alignment for Improved Virtual Try-On

TL;DR

This work tackles the challenge of preserving fine garment details in virtual try-on (VTON) under unpaired and cross-category conditions by exposing and refining the internal person–garment correspondence within Diffusion Transformer (DiT) full 3D attention. It introduces CORAL, which aligns query–key matches with robust external correspondences via a correspondence distillation loss and an entropy minimization loss , integrated into a two-panel diptych DiT architecture. The approach yields state-of-the-art results on standard VTON benchmarks, a VLM-based evaluation protocol, and in-the-wild datasets, with ablations confirming the complementary benefits of the proposed losses. By sharpening spatial attention and grounding it to reliable correspondences, CORAL enhances both global garment shape transfer and local detail fidelity, providing a practical improvement for real-world VTON applications and suggesting avenues for extending correspondence supervision to broader customization tasks. and serve as central constructs for guiding and evaluating alignment within the DiT model.

Abstract

Existing methods for Virtual Try-On (VTON) often struggle to preserve fine garment details, especially in unpaired settings where accurate person-garment correspondence is required. These methods do not explicitly enforce person-garment alignment and fail to explain how correspondence emerges within Diffusion Transformers (DiTs). In this paper, we first analyze full 3D attention in DiT-based architecture and reveal that the person-garment correspondence critically depends on precise person-garment query-key matching within the full 3D attention. Building on this insight, we then introduce CORrespondence ALignment (CORAL), a DiT-based framework that explicitly aligns query-key matching with robust external correspondences. CORAL integrates two complementary components: a correspondence distillation loss that aligns reliable matches with person-garment attention, and an entropy minimization loss that sharpens the attention distribution. We further propose a VLM-based evaluation protocol to better reflect human preference. CORAL consistently improves over the baseline, enhancing both global shape transfer and local detail preservation. Extensive ablations validate our design choices.
Paper Structure (33 sections, 27 equations, 24 figures, 10 tables)

This paper contains 33 sections, 27 equations, 24 figures, 10 tables.

Figures (24)

  • Figure 1: Teaser. (a) CORAL is a Diffusion Transformer-based framework that explicitly enhances person$\rightarrow$garment correspondence within the 3D attention of DiT. This leads to more accurate local details in the generated results, such as fewer artifacts like duplicated garment hems, as shown in the baseline. (b) Virtual Try-On results on VITON-HD and DressCode, comparing outputs without and with CORAL. (c) Person-to-Person garment transfer results on PPR10K in challenging in-the-wild images.
  • Figure 3: Overall Architecture. CORAL builds upon a baseline architecture that constructs the noisy latent $\mathbf{z}_t$ by horizontally concatenating the noisy garment latents $\mathbf{z}_{\text{g},t}$ and person latents $\mathbf{z}_{\text{p},t}$, and then channel-wise concatenates the conditioning canvas $\mathbf{z}_{\text{diptych}}$ and mask canvas $\mathbf{m}_{\text{diptych}}$ with $\mathbf{z}_t$ before the input projection layer. Pose is injected by adding $\mathbf{z}_{\text{pose}}$ as tokens, with RoPE set to share spatial positions between person and pose tokens. $\mathcal{L}_\text{CORAL}$ is applied to the person$\rightarrow$garment matching cost $A^{t,l}_{\mathcal{P}\rightarrow\mathcal{G}}$ estimated from MM-Attention within DiT blocks: $\mathcal{L}_{\text{corr}}$ aligns $A^{t,l}_{\mathcal{P}\rightarrow\mathcal{G}}$ to pseudo ground-truth correspondences extracted from DINOv3, while $\mathcal{L}_{\text{ent}}$ is computed on $A^{t,l}_{\mathcal{P}\rightarrow\mathcal{G}}$ to encourage sharper, more localized matches.
  • Figure 4: Correspondence Visualization and Warped Results. We visualize correspondence fields and the resulting warped garment, computed by mapping pixels from the garment reference image to the garment region of the person image. (a) Baseline attention-derived correspondence. (b) DINOv3 correspondence before reliability filtering. (c) Refined correspondence after cycle-consistency check. The baseline warp shows geometric distortion, while unfiltered DINOv3 can mistakenly match visually similar regions, such as lower-garment areas.
  • Figure 5: Qualitative Comparison. We show qualitative results on the standard benchmarks VITON-HD choi2021vitonhdhighresolutionvirtualtryon and DressCode morelli2022dresscodehighresolutionmulticategory, as well as in-the-wild evaluation dataset built from PPR10K liang2021ppr10klargescaleportraitphoto (Best viewed when zoomed-in). Additional qualitative results are provided in Appendix \ref{['supple:additional_results']}
  • Figure 6: Ablation of Loss Components. We demonstrate the effectiveness of the two losses, $\mathcal{L}_\text{ent}$ and $\mathcal{L}_\text{corr}$. Orange and green markers denote query points, and attention maps outlined in the same colors indicate the matches for each variant. By combining $\mathcal{L}_\text{ent}$ with $\mathcal{L}_\text{corr}$, our model localizes the correct keys most accurately and exhibits the sharpest attention, yielding the best VTON performance.
  • ...and 19 more figures