Table of Contents
Fetching ...

Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates

Yingxuan You, Ren Li, Corentin Dumery, Cong Cao, Hao Li, Pascal Fua

TL;DR

This work introduces a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images, and develops analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time.

Abstract

Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.

Spatio-Temporal Garment Reconstruction Using Diffusion Mapping via Pattern Coordinates

TL;DR

This work introduces a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images, and develops analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time.

Abstract

Reconstructing 3D clothed humans from monocular images and videos is a fundamental problem with applications in virtual try-on, avatar creation, and mixed reality. Despite significant progress in human body recovery, accurately reconstructing garment geometry, particularly for loose-fitting clothing, remains an open challenge. We propose a unified framework for high-fidelity 3D garment reconstruction from both single images and video sequences. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn expressive garment shape priors in 2D UV space. Leveraging these priors, we introduce a mapping model that establishes correspondences between image pixels, UV pattern coordinates, and 3D geometry, enabling accurate and detailed garment reconstruction from single images. We further extend this formulation to dynamic reconstruction by introducing a spatio-temporal diffusion scheme with test-time guidance to enforce long-range temporal consistency. We also develop analytic projection-based constraints that preserve image-aligned geometry in visible regions while enforcing coherent completion in occluded areas over time. Although trained exclusively on synthetically simulated cloth data, our method generalizes well to real-world imagery and consistently outperforms existing approaches on both tight- and loose-fitting garments. The reconstructed garments preserve fine geometric detail while exhibiting realistic dynamic motion, supporting downstream applications such as texture editing, garment retargeting, and animation.
Paper Structure (59 sections, 37 equations, 22 figures, 5 tables)

This paper contains 59 sections, 37 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Given a single image (top) or a monocular video (bottom) of a clothed person, our proposed method can reconstruct high-fidelity 3D garment models with realistic details and temporal consistency.
  • Figure 2: Pipeline. Given an image of a clothed person, we first estimate the front normal $\mathbf{n}_F$ of the target garment, and the SMPL body model which is used to render the body part segmentation ($\mathbf{s}_F$, $\mathbf{s}_B$) and depth ($\mathbf{d}_F^b$, $\mathbf{d}_B^b$) images. The back normal $\mathbf{n}_B$ of the garment is estimated subsequently by the diffusion model $\boldsymbol{\epsilon}_{\theta}^n$. We then predict the UV-coordinate ($\mathbf{c}_F$, $\mathbf{c}_B$) and the depth ($\mathbf{d}_F^g$, $\mathbf{d}_B^g$) images from the garment normal and body estimations with the mapping model $\boldsymbol{\epsilon}_{\theta}^m$. The incomplete UV positional map $\Tilde{\mathcal{U}}$ is produced from them using the camera backprojection. Finally, we fit $\Tilde{\mathcal{U}}$ to DISP to recover the complete UV positional map $\hat{\mathcal{U}}$ and the corresponding garment mesh $\mathbf{g}$, which is further improved by the refinement.
  • Figure 3: Mapping between pixel, 3D, and UV spaces. The pixel $(x,y)$ is mapped to $(X,Y,Z)$ in the 3D space using the estimated depth $d$ and the camera backprojection $P^{-1}$, and to $(u,v)$ in the UV space using the estimated UV coordinates $(u,v,\sigma)$. The dash line indicates that $(X,Y,Z)$ and $(u,v)$ are connected indirectly through $(x,y)$.
  • Figure 4: Recovering garment rest geometry. Given (a) the incomplete panel mask $\Tilde{\mathcal{M}}$, we fit (b) the complete panel mask $\mathcal{M}$ by Eq. \ref{['eq:z']}. (c) shows the overlay of $\Tilde{\mathcal{M}}$ in gray and $\mathcal{M}$ in white. (d) is the corresponding rest-state garment mesh $\bar{\mathbf{g}}$ for (b).
  • Figure 5: Processing a video sequence. Given a set of images with extracted body segmentations $\mathbf{S}$, body depths $\mathbf{D}^b$, and garment normals $\mathbf{N}^F$, our method produces a sequence of garment meshes $\mathbf{G}$ in three steps. First, the back-view normals $\mathbf{N}^B$ of the garment are inferred. By design, our method ensures these predictions are temporally consistent. Second, a mapping network estimates the 2D/3D positions of each pixel, where the 2D positions $(\mathbf{C}_F,\mathbf{C}_B)$ are in a reference pattern space, and the 3D positions are represented as depth maps $(\mathbf{D}_F^g,\mathbf{D}_B^g)$. We introduce novel guidance on this generation to match normal estimations and prevent intersection with the body. Finally, this mapping is unwrapped into a partial 2D pattern $\tilde{\mathbf{U}}$ where pixel value encodes the 3D position, and our temporal inpainting diffusion completes these partial observations into a full garment sequence while ensuring the partial constraints are respected.
  • ...and 17 more figures