Table of Contents
Fetching ...

A Generative Multi-Resolution Pyramid and Normal-Conditioning 3D Cloth Draping

Hunor Laczkó, Meysam Madadi, Sergio Escalera, Jordi Gonzalez

TL;DR

This work addresses 3D garment generation and draping by introducing a pyramid conditional variational autoencoder that operates in a canonical pose space and uses UV map representations. It conditions fabric generation on garment templates, posed body UVs, and surface normals, with a dedicated normal-encoder to enable sampling. The pyramid architecture progressively adds low- to high-frequency details across multiple resolutions, yielding state-of-the-art results on CLOTH3D and CAPE while maintaining generalization to unseen garments and poses even with limited data. The approach achieves fast inference and demonstrates robust handling of detail, texture, and geometric consistency, offering a practical solution for generative 3D cloth draping and virtual try-on applications.

Abstract

RGB cloth generation has been deeply studied in the related literature, however, 3D garment generation remains an open problem. In this paper, we build a conditional variational autoencoder for 3D garment generation and draping. We propose a pyramid network to add garment details progressively in a canonical space, i.e. unposing and unshaping the garments w.r.t. the body. We study conditioning the network on surface normal UV maps, as an intermediate representation, which is an easier problem to optimize than 3D coordinates. Our results on two public datasets, CLOTH3D and CAPE, show that our model is robust, controllable in terms of detail generation by the use of multi-resolution pyramids, and achieves state-of-the-art results that can highly generalize to unseen garments, poses, and shapes even when training with small amounts of data.

A Generative Multi-Resolution Pyramid and Normal-Conditioning 3D Cloth Draping

TL;DR

This work addresses 3D garment generation and draping by introducing a pyramid conditional variational autoencoder that operates in a canonical pose space and uses UV map representations. It conditions fabric generation on garment templates, posed body UVs, and surface normals, with a dedicated normal-encoder to enable sampling. The pyramid architecture progressively adds low- to high-frequency details across multiple resolutions, yielding state-of-the-art results on CLOTH3D and CAPE while maintaining generalization to unseen garments and poses even with limited data. The approach achieves fast inference and demonstrates robust handling of detail, texture, and geometric consistency, offering a practical solution for generative 3D cloth draping and virtual try-on applications.

Abstract

RGB cloth generation has been deeply studied in the related literature, however, 3D garment generation remains an open problem. In this paper, we build a conditional variational autoencoder for 3D garment generation and draping. We propose a pyramid network to add garment details progressively in a canonical space, i.e. unposing and unshaping the garments w.r.t. the body. We study conditioning the network on surface normal UV maps, as an intermediate representation, which is an easier problem to optimize than 3D coordinates. Our results on two public datasets, CLOTH3D and CAPE, show that our model is robust, controllable in terms of detail generation by the use of multi-resolution pyramids, and achieves state-of-the-art results that can highly generalize to unseen garments, poses, and shapes even when training with small amounts of data.
Paper Structure (33 sections, 5 equations, 13 figures, 6 tables)

This paper contains 33 sections, 5 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: The proposed pyramid pipeline (right) contains basic VAE modules for each draping level (VAE$_{drape}$, left). VAE$_{drape}$ receives conditioning inputs and garment offsets and reconstructs the unposed and unshaped garment offsets as UV image. In the case of the first level instead of offsets, absolute coordinates are used (as shown on the left) as this will serve as a base for subsequent levels. The conditioning variables (normals, posed body, and garment template UV images) are given into three pre-trained and frozen encoders to fuse with the VAE$_{drape}$ latent code. These conditioning encoders are trained separately in an autoencoder manner (note that normals are trained through VAE$_{norm}$). Finally, the reconstructed UV image is converted to a mesh and passed to the skinning module after reshaping. Then, in the pyramid module, the lowest resolution level predicts low-frequency garments while the other levels are learned as offsets over their previous level. Each level output is upscaled with the proposed upscaling network and summed to the next level. At inference time, we sample from VAE$_{norm}$ and VAE$_{drape}$ and pass the template garment and posed body UV images.
  • Figure 2: Examples of preprocessed data. a) Cloth unposing and unshaping. b) 3D mesh to UV map. c) Surface normals calculation. d) UV image down/upscaling.
  • Figure 3: Effectiveness of the pyramid architecture on three examples. Ground truth, with baseline and last output of the incremental method, followed by the output of the final pyramid after each level after skinning and shaping. Notice that each subsequent layer adds details to the previous output.
  • Figure 4: Sampling capabilities of the model. (i) ground truth, (ii-v) results by sampling the VAE latent spaces. Notice the produced changes in the blue rectangles. The shown examples have an average area difference of less than 8% compared to the ground truth.
  • Figure 5: Qualitative comparison of ground truth (i) with predictions from DeePSD (ii), HOOD (iii) and our method (iv). HOOD shows improvement over DeePSD in the level of detail. However, it over-stretches the garment. Our method offers a high level of detail while having the minimum stretching and compression over the template shape.
  • ...and 8 more figures