Table of Contents
Fetching ...

PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

Siddharth Seth, Rishabh Dabral, Diogo Luvizon, Marc Habermann, Ming-Hsuan Yang, Christian Theobalt, Adam Kortylewski

TL;DR

PocoLoco is presented - the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing, which operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template.

Abstract

Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco -- the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing -- important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at https://github.com/sidsunny/pocoloco .

PocoLoco: A Point Cloud Diffusion Model of Human Shape in Loose Clothing

TL;DR

PocoLoco is presented - the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing, which operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template.

Abstract

Modeling a human avatar that can plausibly deform to articulations is an active area of research. We present PocoLoco -- the first template-free, point-based, pose-conditioned generative model for 3D humans in loose clothing. We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations. Consequently, they are restricted to modeling clothing that is topologically similar to the naked body and do not extend well to loose clothing. The few methods that attempt to model loose clothing typically require either canonicalization or a UV-parameterization and need to address the challenging problem of explicitly estimating correspondences for the deforming clothes. In this work, we formulate avatar clothing deformation as a conditional point-cloud generation task within the denoising diffusion framework. Crucially, our framework operates directly on unordered point clouds, eliminating the need for a parametric model or a clothing template. This also enables a variety of practical applications, such as point-cloud completion and pose-based editing -- important features for virtual human animation. As current datasets for human avatars in loose clothing are far too small for training diffusion models, we release a dataset of two subjects performing various poses in loose clothing with a total of 75K point clouds. By contributing towards tackling the challenging task of effectively modeling loose clothing and expanding the available data for training these models, we aim to set the stage for further innovation in digital humans. The source code is available at https://github.com/sidsunny/pocoloco .

Paper Structure

This paper contains 23 sections, 7 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Learning avatars in loose clothing with PocoLoco (Ours). Existing approaches, such as POP, rely on a parametric model to simulate clothing deformations. While these methods demonstrate promising results with body-fitted tight clothing, they often exhibit artifacts when modeling loose clothing, which differs topologically from the body shape. This issue is evident in the sparse points generated within the loose clothing regions (55% vs 25% vertices). Additionally, these approaches necessitate fitting a template to the input scan, which can be cumbersome and, from an artist's perspective, undesirable at times. In contrast, our learning-based approach, PocoLoco can model pose-dependent loose clothing deformations without requiring an underlying parametric body model, clothed templates, or complex linear-blend skinning.
  • Figure 1: Qualitative comparison to POP showing results for loose clothing on unseen poses. We show both point clouds and their meshified versions for depicting point density and clothing deformation respectively. We additionally show the percentage of vertices occupying the loose clothing region (skirt). Due to modeling the clothing on top of a template model such as SMPL, the points from POP in the skirt region are too sparse to model any significant deformations. This is due to the points having a hard association with the nearest body part. Our method produces points much more consistently distributed across the body and clothing, thereby exhibiting realistic pose-dependent clothing deformations. Zoomed-in regions emphasize the most significant clothing deformations.
  • Figure 2: Method overview. We visualize the diffusion process (from right to left) incrementally adding noise $\epsilon_t$ at every diffusion step $t$ to the point cloud of a human in loose clothing. During the reverse (generative) process, Gaussian noise is sampled $X^T$ and noise is progressively removed by predicting a residual noise $\hat{\epsilon}_t=f_\theta(X_t)$. The diffusion model $f_\theta$ is composed of cross- and self-attention layers with query, key, and value tokens and a multi-layer perceptron (MLP). The skeletal pose-conditioning is applied in the cross-attention layer.
  • Figure 3: Comparison of our $\beta$ varying schedule (top) vs the standard linear schedule (bottom) for the first 300 (of 1000) steps in the diffusion process. The human shape is preserved longer throughout the diffusion process, which facilitates the learning process and significantly improves the performance (\ref{['sec:ablation']}).
  • Figure 3: Qualitative comparison on top 10% most difficult poses in the loose subject of DynaCap dataset. POP obtains 7.2 cm CD while PocoLoco obtains 5.5 cm CD.
  • ...and 11 more figures