Table of Contents
Fetching ...

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

Yukun Huang, Jianan Wang, Ailing Zeng, Zheng-Jun Zha, Lei Zhang, Xihui Liu

TL;DR

DreamWaltz-G tackles the challenge of generating animatable 3D avatars from text by introducing Skeleton-guided Score Distillation (SkelSD) and a Hybrid 3D Gaussian Avatar (H3GA) representation. The method employs a two-stage training pipeline, Canonical Avatar Learning and Animatable Avatar Learning, and leverages SMPL-X skeletons to condition diffusion priors for 3D-consistent supervision. The hybrid representation combines unconstrained Gaussians with mesh-bound Gaussians linked to SMPL-X parts, enabling stable optimization, real-time rendering, and expressive animation including hands and facial expressions. Empirical results show superior visual quality and animation expressiveness over existing text-to-3D avatar methods, with diverse applications in video reenactment and multi-subject scenes.

Abstract

Leveraging pretrained 2D diffusion models and score distillation sampling (SDS), recent methods have shown promising results for text-to-3D avatar generation. However, generating high-quality 3D avatars capable of expressive animation remains challenging. In this work, we present DreamWaltz-G, a novel learning framework for animatable 3D avatar generation from text. The core of this framework lies in Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation. Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.

DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion

TL;DR

DreamWaltz-G tackles the challenge of generating animatable 3D avatars from text by introducing Skeleton-guided Score Distillation (SkelSD) and a Hybrid 3D Gaussian Avatar (H3GA) representation. The method employs a two-stage training pipeline, Canonical Avatar Learning and Animatable Avatar Learning, and leverages SMPL-X skeletons to condition diffusion priors for 3D-consistent supervision. The hybrid representation combines unconstrained Gaussians with mesh-bound Gaussians linked to SMPL-X parts, enabling stable optimization, real-time rendering, and expressive animation including hands and facial expressions. Empirical results show superior visual quality and animation expressiveness over existing text-to-3D avatar methods, with diverse applications in video reenactment and multi-subject scenes.

Abstract

Leveraging pretrained 2D diffusion models and score distillation sampling (SDS), recent methods have shown promising results for text-to-3D avatar generation. However, generating high-quality 3D avatars capable of expressive animation remains challenging. In this work, we present DreamWaltz-G, a novel learning framework for animatable 3D avatar generation from text. The core of this framework lies in Skeleton-guided Score Distillation and Hybrid 3D Gaussian Avatar representation. Specifically, the proposed skeleton-guided score distillation integrates skeleton controls from 3D human templates into 2D diffusion models, enhancing the consistency of SDS supervision in terms of view and human pose. This facilitates the generation of high-quality avatars, mitigating issues such as multiple faces, extra limbs, and blurring. The proposed hybrid 3D Gaussian avatar representation builds on the efficient 3D Gaussians, combining neural implicit fields and parameterized 3D meshes to enable real-time rendering, stable SDS optimization, and expressive animation. Extensive experiments demonstrate that DreamWaltz-G is highly effective in generating and animating 3D avatars, outperforming existing methods in both visual quality and animation expressiveness. Our framework further supports diverse applications, including human video reenactment and multi-subject scene composition.
Paper Structure (18 sections, 15 equations, 18 figures, 2 tables)

This paper contains 18 sections, 15 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: We present DreamWaltz-G, a text-driven animatable 3D avatar generation framework, which can create high-quality 3D avatars from imaginative text prompts and animate them given motion sequences without manual rigging and retraining. Our method enables various downstream applications, such as expressive animation production, shape editing, human video reenactment, and multi-subject scene composition.
  • Figure 2: The proposed skeleton-guided score distillation utilizes 2D skeleton images $c$ extracted from SMPL-X smplx to condition controllable 2D diffusion model (where we adopt ControlNet controlnet), which enhances the view and pose consistencies between the rendered image $x$ and the SDS supervision $\Delta L_\text{cSDS}$. In addition, we introduce occlusion culling to eliminate keypoints that are invisible from the current viewpoint, preventing ambiguity for the diffusion model.
  • Figure 3: The proposed hybrid 3D Gaussian avatar representation integrates efficient 3D Gaussian Splatting 3dgs with neural implicit field (where we adopt Instant-NGP instant-ngp) and parameterized 3D meshes of SMPL-X smplx body parts (e.g., hands and face). Specifically, the canonical 3D Gaussian avatar is jointly represented by unconstrained 3D Gaussians $\mathcal{G}_\text{u}$ and mesh-binding 3D Gaussians $\mathcal{G}_\text{m}$ bound to parameterized 3D meshes. The colors and opacities of both $\mathcal{G}_\text{u}$ and $\mathcal{G}_\text{m}$ are predicted by the neural implicit field. For animation, $\mathcal{G}_\text{u}$ and $\mathcal{G}_\text{m}$ are deformed separately and merged to form observed 3D Gaussians, then splatted to obtain the rendered avatar image.
  • Figure 4: The proposed animatable 3D avatar generation framework DreamWaltz-G consists of two training stages: (I) Canonical Avatar Learning and (II) Animatable Avatar Learning. In Stage I, We adopt the static Instant-NGP instant-ngp as canonical avatar representation. For each iteration, we extract a skeleton image from canonical SMPL-X smplx to condition ControlNet controlnet. Skeleton-conditioned score distillation loss $L_\text{cSDS}$ is used as a training objective to learn the canonical avatar. In Stage II, the proposed animatable avatar representation H3GA is first initialized with the trained Instant-NGP from Stage I and then optimized by $L_\text{cSDS}$. Unlike Stage I, which uses a fixed canonical pose, in Stage II, we randomly sample plausible human poses and expressions in each iteration to drive H3GA and SMPL-X, encouraging avatar learning across different motions.
  • Figure 5: Qualitative results of canonical avatars compared to existing text-driven 3D avatar generation methods: DreamWaltz huang2023dreamwaltz, DreamHuman kolotouros2024dreamhuman, TADA liao2024tada, GAvatar yuan2024gavatar, HumanGaussian liu2024humangaussian. The text prompts used are listed on the left.
  • ...and 13 more figures