Disentangled Clothed Avatar Generation from Text Descriptions

Jionghao Wang; Yuan Liu; Zhiyang Dou; Zhengming Yu; Yongqing Liang; Cheng Lin; Xin Li; Wenping Wang; Rong Xie; Li Song

Disentangled Clothed Avatar Generation from Text Descriptions

Jionghao Wang, Yuan Liu, Zhiyang Dou, Zhengming Yu, Yongqing Liang, Cheng Lin, Xin Li, Wenping Wang, Rong Xie, Li Song

TL;DR

A novel text-to-avatar generation method that separately generates the human body and the clothes and allows high-quality animation on the generated avatar and significantly improves the visual quality of character animation, virtual try-on, and avatar editing.

Abstract

In this paper, we introduce a novel text-to-avatar generation method that separately generates the human body and the clothes and allows high-quality animation on the generated avatar. While recent advancements in text-to-avatar generation have yielded diverse human avatars from text prompts, these methods typically combine all elements-clothes, hair, and body-into a single 3D representation. Such an entangled approach poses challenges for downstream tasks like editing or animation. To overcome these limitations, we propose a novel disentangled 3D avatar representation named Sequentially Offset-SMPL (SO-SMPL), building upon the SMPL model. SO-SMPL represents the human body and clothes with two separate meshes but associates them with offsets to ensure the physical alignment between the body and the clothes. Then, we design a Score Distillation Sampling (SDS)-based distillation framework to generate the proposed SO-SMPL representation from text prompts. Our approach not only achieves higher texture and geometry quality and better semantic alignment with text prompts, but also significantly improves the visual quality of character animation, virtual try-on, and avatar editing. Project page: https://shanemankiw.github.io/SO-SMPL/.

Disentangled Clothed Avatar Generation from Text Descriptions

TL;DR

Abstract

Paper Structure (28 sections, 13 equations, 20 figures, 3 tables)

This paper contains 28 sections, 13 equations, 20 figures, 3 tables.

Introduction
Related work
Preliminaries
Methodology
SO-SMPL Representation
SO-SMPL Rendering
SO-SMPL Generation
Experiments
Generated Avatars & Clothes
Quality Comparisons
Quantitative Comparisons
User Study
Ablation Study
Applications
Limitations&Conlusion
...and 13 more sections

Figures (20)

Figure 1: Our method generates high-quality separated human body and clothes meshes from text prompts. Kinematics or simulation motions can drive the disentangled avatar representations to achieve photorealistic animations.
Figure 2: An overview of our generation pipeline. Our pipeline has two stages. In Stage I, we generate a base human body model by optimizing its shape parameter $\beta$, vertex offset $\mathbf{O}_{\text{h}}$ and albedo texture $\Gamma_{\text{h}}$. A ControlNet is utilized to compute a Score Distillation Sampling(SDS) Loss conditioned on an "A"-pose map $\mathcal{P}$ for the rendered RGB image $\mathbf{I}_{\text{h}}$ and normal map $\mathbf{N}_{\text{h}}$. In Stage II, we freeze the human body model and optimize the clothes parameters $\mathbf{O}_{\text{c}}$ along with the albedo texture MLP $\Gamma_{\text{c}}$. The rendered RGB images and normal maps of both the clothed human and the clothes are used in computing the SDS losses. In both stages, we utilized a simple Phong shading model to render images from our SO-SMPL representations.
Figure 3: Illustration of our SO-SMPL representation. Two vertex-wise offsets, namely the human offset $\mathbf{O}_{\text{h}}$ and the clothes offset $\mathbf{O}_{\text{c}}$ are sequentially added in order to the SMPL-X body mesh $\mathbf{T}(\beta)$ to obtain the human body mesh $\mathbf{T}_{\text{h}}(\beta, \mathbf{O}_{\text{h}})$ and clothed human mesh $\mathbf{T}_{\text{c+h}}(\beta, \mathbf{O}_{\text{h}}, \mathbf{O}_{\text{c}}, \mathbf{M}_{\text{c}})$, where a vertex mask $\mathbf{M}_{\text{c}}$ is calculated to determine the clothing region. Finally, we mask the clothed human mesh with $\mathbf{M}_{\text{c}}$ and obtain the separated clothes mesh $\mathbf{T}_{\text{c}}(\beta, \mathbf{O}_{\text{h}},\mathbf{O}_{\text{c}},\mathbf{M}_{\text{c}})$.
Figure 4: An intuitive illustration of the impact of the baked-in artifacts in learned albedo. On the left, the generated pant suffers from severe baked-in wrinkles in its texture, resulting in non-photorealistic wrinkles and shadows in animation. On the right side, our shader prevents the shadows from baking into the albedo, hence significantly improving the visual quality of animations.
Figure 5: A gallery of our generated clothed avatars. Our method can generate avatars with varied ethnicities, genders, and clothes.
...and 15 more figures

Disentangled Clothed Avatar Generation from Text Descriptions

TL;DR

Abstract

Disentangled Clothed Avatar Generation from Text Descriptions

Authors

TL;DR

Abstract

Table of Contents

Figures (20)