Table of Contents
Fetching ...

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

Zhongjin Luo, Haolin Liu, Chenghong Li, Wanghao Du, Zirong Jin, Wanhu Sun, Yinyu Nie, Weikai Chen, Xiaoguang Han

TL;DR

This work presents GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image and proposes a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism.

Abstract

Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image. Inspired by the recent success of large generative models, we believe that one key to addressing the generalization challenge lies in the quantity and quality of 3D garment data. Towards this end, GarVerseLOD collects 6,000 high-quality cloth models with fine-grained geometry details manually created by professional artists. In addition to the scale of training data, we observe that having disentangled granularities of geometry can play an important role in boosting the generalization capability and inference accuracy of the learned model. We hence craft GarVerseLOD as a hierarchical dataset with levels of details (LOD), spanning from detail-free stylized shape to pose-blended garment with pixel-aligned details. This allows us to make this highly under-constrained problem tractable by factorizing the inference into easier tasks, each narrowed down with smaller searching space. To ensure GarVerseLOD can generalize well to in-the-wild images, we propose a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism. We evaluate our method on a massive amount of in-the-wild images. Experimental results demonstrate that GarVerseLOD can generate standalone garment pieces with significantly better quality than prior approaches. Project page: https://garverselod.github.io/

GarVerseLOD: High-Fidelity 3D Garment Reconstruction from a Single In-the-Wild Image using a Dataset with Levels of Details

TL;DR

This work presents GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image and proposes a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism.

Abstract

Neural implicit functions have brought impressive advances to the state-of-the-art of clothed human digitization from multiple or even single images. However, despite the progress, current arts still have difficulty generalizing to unseen images with complex cloth deformation and body poses. In this work, we present GarVerseLOD, a new dataset and framework that paves the way to achieving unprecedented robustness in high-fidelity 3D garment reconstruction from a single unconstrained image. Inspired by the recent success of large generative models, we believe that one key to addressing the generalization challenge lies in the quantity and quality of 3D garment data. Towards this end, GarVerseLOD collects 6,000 high-quality cloth models with fine-grained geometry details manually created by professional artists. In addition to the scale of training data, we observe that having disentangled granularities of geometry can play an important role in boosting the generalization capability and inference accuracy of the learned model. We hence craft GarVerseLOD as a hierarchical dataset with levels of details (LOD), spanning from detail-free stylized shape to pose-blended garment with pixel-aligned details. This allows us to make this highly under-constrained problem tractable by factorizing the inference into easier tasks, each narrowed down with smaller searching space. To ensure GarVerseLOD can generalize well to in-the-wild images, we propose a novel labeling paradigm based on conditional diffusion models to generate extensive paired images for each garment model with high photorealism. We evaluate our method on a massive amount of in-the-wild images. Experimental results demonstrate that GarVerseLOD can generate standalone garment pieces with significantly better quality than prior approaches. Project page: https://garverselod.github.io/

Paper Structure

This paper contains 36 sections, 17 equations, 20 figures, 5 tables.

Figures (20)

  • Figure 1: The pipeline of our novel strategy for constructing a progressive garment dataset with levels of details. (a) Each case shows the reference image and the artist-crafted T-pose coarse garment in Garment Style Database. (b) A example of the reference image and the artist-crafted detail-pair in Local Detail Database. (c) A example of the reference image and the artist-crafted deformation-pair in Garment Deformation Database. (d) To obtain an T-pose garment with geometric details, we first sample a shape $M_C$ from the Garment Style Database and a "Local Detail Pair" ($L_C$, $L_F$) from the Local Detail Database. Then we transfer the geometric details depicted by ($L_C$, $L_F$) to $M_C$ to obtain $M_L$. (e) The deformation depicted by a sampled "Garment Deformation Pair" ($D_T$, $D_F$) is transferred to $M_L$ to obtain the fine garment $M_D$, which contains fine-grained geometric details and complex deformations (Fine Garment Dataset). Original images courtesy of licensed photos.
  • Figure 2: Left: Our novel strategy for generating extensive photorealistic paired images. We acquire rendered images of 3D garments with random camera views. These rendered images are processed through Canny-Conditional Stable Diffusion rombach2022highmou2023t2izhang2023adding to produce photorealistic images. Right: (a) The garment sampled from Fine Garment Dataset; (b) The synthesized image; (c) The pixel-aligned mask; (d) The normal map rendered using (a); (e) The garment mask rendered by (a); (f) The counterpart T-pose coarse garment of (a). In Sec. \ref{['sec:method']}, (b, f) is used to train the coarse garment estimator, while (b,c,d) is adopted to train the normal estimator. (d, e, a) is utilized to train the fine garment estimator and the geometry-aware boundary predictor. Synthesized images courtesy of Stable Diffusion.
  • Figure 3: The pipeline of our proposed method. Given an RGB image, our method first estimates the T-pose garment shape $G({{\alpha}})$ (Eq. \ref{['math:G']}) and computes its pose-related deformation $M_P(\alpha, \beta, \theta)$ with the help of the predicted SMPL body (Eq. \ref{['math:TG']}, Eq. \ref{['math:MP']}). Then a pixel-aligned network is used to reconstruct implicit fine garment $M_I$ and the geometry-aware boundary estimator is adopted to predict the garment boundary. Finally, we register $M_P(\cdot)$ to $M_I$ to obtain the final mesh $M_F$, which has fine topology and open-boundaries. Images courtesy of Stable Diffusion.
  • Figure 4: Result gallery of our method. Each image is followed by the reconstructed garment mesh. As illustrated, our method can effectively reconstruct garments with intricate deformations and fine-grained surface details. To support the modeling of folded structures, such as collars, we assembled a repository of diverse real-world collars that were crafted based on our topologically-consistent garments. A lightweight classification network was trained to select the collar that best matches the given image in terms of appearance zhu2022registering. Original images courtesy of licensed photos and Stable Diffusion. The images with a gray background are synthesized, while the rest are licensed photos.
  • Figure 5: Qualitative comparison between ours and the state of the arts. For each row, the input image is followed by the results generated by BCNet jiang2020bcnet, ClothWild moon20223d, Deep Fashion3D zhu2020deep, ReEF zhu2022registering and our method. Input images courtesy of Stable Diffusion.
  • ...and 15 more figures