Table of Contents
Fetching ...

Text-based Animatable 3D Avatars with Morphable Model Alignment

Yiqian Wu, Malte Prinzler, Xiaogang Jin, Siyu Tang

TL;DR

AnimPortrait3D tackles the challenge of text-to-animatable 3D head avatars by separating initialization from dynamic optimization. It initializes a robust, SMPL-X-aligned avatar from a text description using Portrait3D-based geometry and appearance priors, then uses a ControlNet conditioned on dense normal and segmentation maps to guide diffusion-based refinement for dynamic expressions. Key contributions include a two-stage framework, a rigorous appearance/geometry initialization pipeline with hair/clothing asset generation, and a region-aware optimization using pre-trained eye/mouth guidance plus Interval Score Matching and SDEdit refinement. The approach yields higher synthesis quality, tighter alignment to the parametric model, and improved animation fidelity, advancing the state of the art in text-driven animatable 3D head avatars with practical implications for games, cinema, and virtual assistants.

Abstract

The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar's appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation.

Text-based Animatable 3D Avatars with Morphable Model Alignment

TL;DR

AnimPortrait3D tackles the challenge of text-to-animatable 3D head avatars by separating initialization from dynamic optimization. It initializes a robust, SMPL-X-aligned avatar from a text description using Portrait3D-based geometry and appearance priors, then uses a ControlNet conditioned on dense normal and segmentation maps to guide diffusion-based refinement for dynamic expressions. Key contributions include a two-stage framework, a rigorous appearance/geometry initialization pipeline with hair/clothing asset generation, and a region-aware optimization using pre-trained eye/mouth guidance plus Interval Score Matching and SDEdit refinement. The approach yields higher synthesis quality, tighter alignment to the parametric model, and improved animation fidelity, advancing the state of the art in text-driven animatable 3D head avatars with practical implications for games, cinema, and virtual assistants.

Abstract

The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar's appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation.

Paper Structure

This paper contains 48 sections, 8 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Overview of AnimPortrait3D. Given an input text, the 3D Avatar Initialization stage (\ref{['sec: Initialization']}) generates a well-defined initial avatar that provides appearance and geometry prior information, and is rigged to SMPL-X for animation. During the Dynamic Optimization stage (\ref{['sec: Optimization']}), we optimize the avatar for dynamic poses and expressions using a 2D diffusion model and a ControlNet. We first pre-train the eye and mouth regions, then optimize the full avatar and apply a refinement strategy to produce the final result. AnimPortrait3D is able to generate avatars with diverse appearances, ethnicities, and ages.
  • Figure 2: The visualization of (a) the static 3D avatar $P$ from Portrait3D, (b) the fitted SMPL-X model, (c) noisy mesh $M_{{raw}}$ extracted from $P$, (d) smoothed mesh $M_{{smooth}}$, (e) normal map estimated from the renderings of $P$, (f) $M_{{refined}}$ optimized against the estimated normal maps, (g) segmented hair mesh, (h) segmented clothing mesh, and (i) segmented face mesh.
  • Figure 3: (a) The ground truth image from ControlNet's training dataset (originally derived from the FFHQ stylegan). (b) Conditional normal maps. (c) Conditional segmentation maps. (d) Generated results using the corresponding inputs.
  • Figure 4: We optimize the eye region, mouth region, and full avatar sequentially, employing distinct loss functions at each stage. A Diffusion model together with a ControlNet conditioned on normal- and segmentation maps provide the guidance during optimization. Only for renderings of the full avatar, we omit the ControlNet and rely solely on the Diffusion model.
  • Figure 5: Generated results of our method. For each 3D avatar, we present rendered images with varying expressions and poses across different camera views, and the corresponding mesh for each avatar is shown at the lower right corner of each rendered image.
  • ...and 11 more figures