BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Shuang Liu; Ao Yu; Linkang Cheng; Xiwen Huang; Li Zhao; Junhui Liu; Zhiting Lin; Yu Liu

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Shuang Liu, Ao Yu, Linkang Cheng, Xiwen Huang, Li Zhao, Junhui Liu, Zhiting Lin, Yu Liu

TL;DR

BridgeDiff is proposed, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components that achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

Abstract

Virtual try-off (VTOFF) aims to recover canonical flat-garment representations from images of dressed persons for standardized display and downstream virtual try-on. Prior methods often treat VTOFF as direct image translation driven by local masks or text-only prompts, overlooking the gap between on-body appearances and flat layouts. This gap frequently leads to inconsistent completion in unobserved regions and unstable garment structure. We propose BridgeDiff, a diffusion-based framework that explicitly bridges human-centric observations and flat-garment synthesis through two complementary components. First, the Garment Condition Bridge Module (GCBM) builds a garment-cue representation that captures global appearance and semantic identity, enabling robust inference of continuous details under partial visibility. Second, the Flat Structure Constraint Module (FSCM) injects explicit flat-garment structural priors via Flat-Constraint Attention (FC-Attention) at selected denoising stages, improving structural stability beyond text-only conditioning. Extensive experiments on standard VTOFF benchmarks show that BridgeDiff achieves state-of-the-art performance, producing higher-quality flat-garment reconstructions while preserving fine-grained appearance and structural integrity.

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

TL;DR

Abstract

Paper Structure (28 sections, 15 equations, 16 figures, 7 tables)

This paper contains 28 sections, 15 equations, 16 figures, 7 tables.

Introduction
Related Work
Methodology
Garment Condition Bridge Module
Flat Structure Constraint for Conditional Diffusion
Experiment
Datasets and Metrics
Implementation details
Main Results
Ablation Studies and Analysis
User Study
Conclusion and Future Work
Preliminaries and Notations
Preliminaries
Latent Diffusion Models.
...and 13 more sections

Figures (16)

Figure 1: Existing methods often suffer from visual discontinuity and structural instability when relying solely on textual conditioning or weak local constraints, especially under occlusions or partial observations. In contrast, BridgeDiff bridges dressed-person observations and canonical flat-garment representations via garment cues representation and explicit flat structure guidance.
Figure 2: Overview of the proposed GCBM. Rather than directly mapping dressed-person observations to flat garment images, GCBM aggregates multiple visual information into garment cues representation, capturing the global appearance and identity of the target garment to support visually continuous flat-garment synthesis.
Figure 3: Overview of the proposed Flat Structure Constraint for Conditional Diffusion architecture. The framework consists of a trainable model UNet and a largely frozen denoising UNet. To explicitly enforce flat garment layouts, a flat structure constraint module (FSCM) is integrated into the denoising UNet, ensuring stable layout generation without compromising appearance fidelity.
Figure 4: Qualitative comparisons on the DressCode dataset. Red circles highlight differences in local regions across different methods. Unmarked examples indicate cases where the overall garment structure or color appearance differs from the reference. Zooming in provides a clearer view of these differences.
Figure 5: Qualitative comparisons on the VITON-HD dataset. Red circles highlight differences in local regions across different methods. Unmarked examples indicate cases where the overall garment structure or color appearance differs from the reference. Zooming in provides a clearer view of these differences.
...and 11 more figures

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

TL;DR

Abstract

BridgeDiff: Bridging Human Observations and Flat-Garment Synthesis for Virtual Try-Off

Authors

TL;DR

Abstract

Table of Contents

Figures (16)