Table of Contents
Fetching ...

DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh

Jingyu Zhuang, Di Kang, Linchao Bao, Liang Lin, Guanbin Li

TL;DR

DAGSM tackles the limitation of unified clothed-human models by learning disentangled avatars with GS-enhanced meshes (GSM) for the body and garments, enabling clothing replacement and physics-based animation. It uses a sequential pipeline: first generating an unclothed body, then garments bound to mesh via 2D Gaussians, and finally refining textures with a view-consistent framework that combines cross-view attention and incident-angle-weighted denoising. Key contributions include the GSM representation with per-Gaussian UV maps, SAM-based garment separation, and a novel view-consistent texture refinement that improves cross-view consistency and material fidelity. The approach yields high-quality, animatable avatars that support editing, texture control, and reference-image guided appearance, with strong empirical gains over baselines in visual quality and alignment to prompts.

Abstract

Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.

DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh

TL;DR

DAGSM tackles the limitation of unified clothed-human models by learning disentangled avatars with GS-enhanced meshes (GSM) for the body and garments, enabling clothing replacement and physics-based animation. It uses a sequential pipeline: first generating an unclothed body, then garments bound to mesh via 2D Gaussians, and finally refining textures with a view-consistent framework that combines cross-view attention and incident-angle-weighted denoising. Key contributions include the GSM representation with per-Gaussian UV maps, SAM-based garment separation, and a novel view-consistent texture refinement that improves cross-view consistency and material fidelity. The approach yields high-quality, animatable avatars that support editing, texture control, and reference-image guided appearance, with strong empirical gains over baselines in visual quality and alignment to prompts.

Abstract

Text-driven avatar generation has gained significant attention owing to its convenience. However, existing methods typically model the human body with all garments as a single 3D model, limiting its usability, such as clothing replacement, and reducing user control over the generation process. To overcome the limitations above, we propose DAGSM, a novel pipeline that generates disentangled human bodies and garments from the given text prompts. Specifically, we model each part (e.g., body, upper/lower clothes) of the clothed human as one GS-enhanced mesh (GSM), which is a traditional mesh attached with 2D Gaussians to better handle complicated textures (e.g., woolen, translucent clothes) and produce realistic cloth animations. During the generation, we first create the unclothed body, followed by a sequence of individual cloth generation based on the body, where we introduce a semantic-based algorithm to achieve better human-cloth and garment-garment separation. To improve texture quality, we propose a view-consistent texture refinement module, including a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to update the appearance. Extensive experiments have demonstrated that DAGSM generates high-quality disentangled avatars, supports clothing replacement and realistic animation, and outperforms the baselines in visual quality.

Paper Structure

This paper contains 25 sections, 14 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Given text prompts, our method DAGSM allows the users to generate disentangled avatars in diverse styles (e.g., real, cartoon) with various garments. Our method separately generates the human body and garments for disentanglement, so our method naturally supports clothing replacement. We represent every single part (e.g., body, upper/lower clothes) using a hybrid GS-enhanced mesh, where the 2D Gaussians are attached on a proxy mesh to better handle complicated cloth texture (e.g., cotton, woolen, and transparent fabric in row 1, left 2) and produce realistic animations.
  • Figure 2: Method overview. Given text prompts, DAGSM generates disentangle digital humans whose bodies and clothes are represented as multiple individual GSM (Sec. \ref{['subsec:model']}). The generation process includes three stages: 1) a body generation stage that generates an unclothed body with the human priors SMPL-X pavlakos2019expressive from the guidance of text-to-image model SD esser2024scaling (Sec. \ref{['subsec:body']}); 2) a cloth generation stage that first creates the cloth's mesh proxy. Then 2DGS $\mathcal{G}_{b}$ is bound to the mesh for generating a garment with texture (Sec. \ref{['subsec:cloth']}); and 3) a view-consistent refinement stage, where we propose a cross-view attention mechanism for texture style consistency and an incident-angle-weighted denoising (IAW-DE) strategy to enhance the appearance image $\hat{\mathcal{V}}_i$ (Sec. \ref{['subsec:refinement']}).
  • Figure 3: Visual comparisons. Results generated by our method have significantly higher visual quality, accurately follow the input text prompt, and can be naturally animated. In contrast, results from DreamWaltz are in low-resolution and contain obvious structural problems. HumanGaussian produces unexpected results (e.g., basketball), with issues such as a split skirt in the animation. SO-SMPL is limited by its inability to generate clothing beyond the body's topology (e.g., dresses), restricting its applicability.
  • Figure 4: Ablation study on SAM-based filtering. The results demonstrate its effectiveness in filtering out the noise belonging to the body, achieving better body-garment separation.
  • Figure 5: Ablation study on view-consistent refinement. To better demonstrate the texture, we render a single layer of the white lace dress (i.e., the layer not occluded by the body) on a black background. Our view-consistent refinement (4) effectively improves the suboptimal textures generated by RFDS loss (1). This refinement includes two key components: cross-view attention, which ensures texture style consistency across views (2 vs. 4); and IAW-DE, which reduces the blurriness of the texture caused by view inconsistency (3 vs. 4).
  • ...and 5 more figures