Table of Contents
Fetching ...

HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

Yi Wang, Jian Ma, Ruizhi Shao, Qiao Feng, Yu-kun Lai, Kun Li

TL;DR

This work addresses the challenge of generating editable, physically-layered 3D humans from text prompts by introducing a physically-decoupled diffusion framework. It combines a two-stage pipeline: (i) canonical body generation via NeRF controlled by SMPL and ControlNet, and (ii) dual-representation decoupling (DRD) for clothing with multi-layer fusion rendering, plus an SMPL-powered implicit deformation network (SID Net) to align garments with varying body shapes. Key contributions include the DRD for disentangled clothing semantics, a multi-layer rendering strategy for layering, and the SID Net for accurate garment–body matching, enabling clothing transfer across identities and pose-driven animation. The approach achieves state-of-the-art layered 3D human generation, supports virtual try-on and layered animation, and offers practical benefits for digital avatars, fashion, and entertainment, with future work on more precise deformation proxies and collision-aware optimization. The technical core relies on loss terms such as $L_{body}$, $L_{SDS}$, $L_n$, $L_n^{reg}$, $\mathcal{L}_{reg\_ds}$, $\mathcal{L}_r$, $\mathcal{L}_{match}$, and $\mathcal{L}_{offset}^{reg}$ to drive accurate, decoupled garment generation and fitting.$

Abstract

This paper aims to generate physically-layered 3D humans from text prompts. Existing methods either generate 3D clothed humans as a whole or support only tight and simple clothing generation, which limits their applications to virtual try-on and part-level editing. To achieve physically-layered 3D human generation with reusable and complex clothing, we propose a novel layer-wise dressed human representation based on a physically-decoupled diffusion model. Specifically, to achieve layer-wise clothing generation, we propose a dual-representation decoupling framework for generating clothing decoupled from the human body, in conjunction with an innovative multi-layer fusion volume rendering method. To match the clothing with different body shapes, we propose an SMPL-driven implicit field deformation network that enables the free transfer and reuse of clothing. Extensive experiments demonstrate that our approach not only achieves state-of-the-art layered 3D human generation with complex clothing but also supports virtual try-on and layered human animation.

HumanCoser: Layered 3D Human Generation via Semantic-Aware Diffusion Model

TL;DR

This work addresses the challenge of generating editable, physically-layered 3D humans from text prompts by introducing a physically-decoupled diffusion framework. It combines a two-stage pipeline: (i) canonical body generation via NeRF controlled by SMPL and ControlNet, and (ii) dual-representation decoupling (DRD) for clothing with multi-layer fusion rendering, plus an SMPL-powered implicit deformation network (SID Net) to align garments with varying body shapes. Key contributions include the DRD for disentangled clothing semantics, a multi-layer rendering strategy for layering, and the SID Net for accurate garment–body matching, enabling clothing transfer across identities and pose-driven animation. The approach achieves state-of-the-art layered 3D human generation, supports virtual try-on and layered animation, and offers practical benefits for digital avatars, fashion, and entertainment, with future work on more precise deformation proxies and collision-aware optimization. The technical core relies on loss terms such as , , , , , , , and to drive accurate, decoupled garment generation and fitting.$

Abstract

This paper aims to generate physically-layered 3D humans from text prompts. Existing methods either generate 3D clothed humans as a whole or support only tight and simple clothing generation, which limits their applications to virtual try-on and part-level editing. To achieve physically-layered 3D human generation with reusable and complex clothing, we propose a novel layer-wise dressed human representation based on a physically-decoupled diffusion model. Specifically, to achieve layer-wise clothing generation, we propose a dual-representation decoupling framework for generating clothing decoupled from the human body, in conjunction with an innovative multi-layer fusion volume rendering method. To match the clothing with different body shapes, we propose an SMPL-driven implicit field deformation network that enables the free transfer and reuse of clothing. Extensive experiments demonstrate that our approach not only achieves state-of-the-art layered 3D human generation with complex clothing but also supports virtual try-on and layered human animation.
Paper Structure (16 sections, 15 equations, 12 figures, 3 tables)

This paper contains 16 sections, 15 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Illustration of our framework for generating the clothes and body of a dressed human in a layered manner. (a) shows the generation of the minimized body, and (b) shows the layered generation of clothing and the matching of clothing with the body.
  • Figure 2: The decoupled generation of human body and clothing by our method. (a) clothing prompt: "A turquoise Cheongsam", (b) clothing prompt: "A deep-skyblue sleeveless sheath dress with lace trims", (c) clothing prompt: "A Duffle Coat and a baggy linen pants", (d) clothing prompt: “A Car Coat and a baggy jeans”.
  • Figure 3: Quantitative results. Our method and methods a23a11a46a54 are evaluated by using the method a38 to measure the visual quality of the generated 3D content, where higher scores are better.
  • Figure 4: Qualitative comparison with coupled generation methods a23a46a1. (a) prompt: "A north American Indian chief in full regalia", (b) prompt: "A Chinese lady wearing a gauzy hanfu", (c) prompt: "A Hawaiian woman wearing a hula skirt", (d) prompt: “A French woman wearing a light blue crinoline dress”.
  • Figure 5: Qualitative comparison with the layered method a48.
  • ...and 7 more figures