Table of Contents
Fetching ...

SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart

TL;DR

SCULPT addresses the challenge of generating realistic, clothed, textured 3D human meshes from unpaired data by decoupling geometry and appearance learning: geometry is learned from 3D scan data as pose-dependent UV-space displacements on the SMPL template, while texture is learned from large-scale 2D fashion images using a geometry-conditioned texture generator. The model comprises two StyleGAN3-based networks, G_geo for geometry and G_tex for texture, with clothing-type and color labels derived automatically via BLIP and CLIP to reduce pose-clothing entanglement, and G_tex further conditioned on intermediate features from G_geo to enforce geometry–texture coherence. Training proceeds in two stages with a dual-discriminator setup and differentiable rendering, producing explicit meshes and texture maps that are articulation-ready and suitable for standard graphics pipelines. Compared to state-of-the-art 3D clothed-human methods, SCULPT offers improved geometry quality, richer texture details, and explicit semantic controls over clothing type and color, enabling flexible, render-ready 3D human generation for games, films, and synthetic data creation.

Abstract

We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de.

SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

TL;DR

SCULPT addresses the challenge of generating realistic, clothed, textured 3D human meshes from unpaired data by decoupling geometry and appearance learning: geometry is learned from 3D scan data as pose-dependent UV-space displacements on the SMPL template, while texture is learned from large-scale 2D fashion images using a geometry-conditioned texture generator. The model comprises two StyleGAN3-based networks, G_geo for geometry and G_tex for texture, with clothing-type and color labels derived automatically via BLIP and CLIP to reduce pose-clothing entanglement, and G_tex further conditioned on intermediate features from G_geo to enforce geometry–texture coherence. Training proceeds in two stages with a dual-discriminator setup and differentiable rendering, producing explicit meshes and texture maps that are articulation-ready and suitable for standard graphics pipelines. Compared to state-of-the-art 3D clothed-human methods, SCULPT offers improved geometry quality, richer texture details, and explicit semantic controls over clothing type and color, enabling flexible, render-ready 3D human generation for games, films, and synthetic data creation.

Abstract

We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de.
Paper Structure (19 sections, 5 equations, 13 figures, 2 tables)

This paper contains 19 sections, 5 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: SCULPT is a generative model of geometry and appearance of clothed human meshes. The generated textured clothing mesh can be readily inserted into 3D scenes. In the above figure, we generated the clothed humans and placed them on a 3D floor. The scene has a single camera with global directional light settings.
  • Figure 2: Overview: SCULPT consists of two StyleGAN-based generators for geometry ($\mathcal{G}_{geo}$) and appearance ($\mathcal{G}_{tex}$), both acting in the UV space of the SMPL body model. The geometry network $\mathcal{G}_{geo}$ outputs pose-dependent displacement maps that are added to the SMPL template mesh and is trained using 3D scan data. Based on this model, the appearance generator $\mathcal{G}_{tex}$ is trained in an unsupervised way using adversarial losses computed on rendered images of the generated synthetic human. It is conditioned on intermediate features of the geometry network. Besides the noise code, both generator networks receive additional attributes for appearance ($\mathbf{c}_{t}$) and clothing type ($\mathbf{c}_{g}$) as input. This enhances the connection between appearance and geometry, and it offers a user-friendly control over the generation.
  • Figure 3: Pose control: The figure presents three meshes of an individual. The initial two meshes depict the individual in different types of clothing (long-long and short-short) but maintaining the same pose, while the third mesh illustrates the individual in a different pose, wearing the same type of clothing as depicted in the first mesh. The pose-dependent clothing deformations are visible in the zoomed-in images on the side, which are produced by the geometry generator. It is observed that identical poses result in similar deformations, as indicated by the color-coded bounding boxes, while a change in pose leads to distinct clothing deformations in the geometry.
  • Figure 4: Pose control: Varying pose while keeping other factors fixed. Each row contains two different identities, each in two poses. Texture and geometry meshes are shown side-by-side.
  • Figure 5: Cloth-type control: Varying $\mathbf{c}_{g}$ while keeping other factors fixed. Each row contains two separate identities. For each identity, we show two different clothing types consecutively. Texture and geometry meshes are shown side-by-side.
  • ...and 8 more figures