Table of Contents
Fetching ...

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

Yiwei Ma, Zhekai Lin, Jiayi Ji, Yijun Fan, Xiaoshuai Sun, Rongrong Ji

TL;DR

X-Oscar addresses oversaturation and quality gaps in text-guided 3D avatar generation by proposing a progressive Geometry->Texture->Animation framework anchored on SMPL-X as prior. It introduces Adaptive Variational Parameter to represent avatars as adaptive distributions, mitigating oversaturation, and Avatar-aware Score Distillation Sampling to inject geometry- and appearance-aware noise during diffusion-based optimization. Through comprehensive experiments against state-of-the-art text-to-3D and text-to-avatar methods, X-Oscar demonstrates superior geometry, texture fidelity, and full animatability, enabling high-fidelity avatars for gaming, AR/VR, and digital media. The framework offers a practical, automated path from text prompts to high-quality, editable 3D avatars with robust optimization dynamics.

Abstract

Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

TL;DR

X-Oscar addresses oversaturation and quality gaps in text-guided 3D avatar generation by proposing a progressive Geometry->Texture->Animation framework anchored on SMPL-X as prior. It introduces Adaptive Variational Parameter to represent avatars as adaptive distributions, mitigating oversaturation, and Avatar-aware Score Distillation Sampling to inject geometry- and appearance-aware noise during diffusion-based optimization. Through comprehensive experiments against state-of-the-art text-to-3D and text-to-avatar methods, X-Oscar demonstrates superior geometry, texture fidelity, and full animatability, enabling high-fidelity avatars for gaming, AR/VR, and digital media. The framework offers a practical, automated path from text prompts to high-quality, editable 3D avatars with robust optimization dynamics.

Abstract

Recent advancements in automatic 3D avatar generation guided by text have made significant progress. However, existing methods have limitations such as oversaturation and low-quality output. To address these challenges, we propose X-Oscar, a progressive framework for generating high-quality animatable avatars from text prompts. It follows a sequential Geometry->Texture->Animation paradigm, simplifying optimization through step-by-step generation. To tackle oversaturation, we introduce Adaptive Variational Parameter (AVP), representing avatars as an adaptive distribution during training. Additionally, we present Avatar-aware Score Distillation Sampling (ASDS), a novel technique that incorporates avatar-aware noise into rendered images for improved generation quality during optimization. Extensive evaluations confirm the superiority of X-Oscar over existing text-to-3D and text-to-avatar approaches. Our anonymous project page: https://xmu-xiaoma666.github.io/Projects/X-Oscar/.
Paper Structure (12 sections, 18 equations, 7 figures, 1 table)

This paper contains 12 sections, 18 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Samples generated by X-Oscar along temporal and viewpoint dimensions. Left Prompt: "Steven Paul Jobs". Right Prompt: "David of Michelangelo".
  • Figure 2: Overview of the proposed X-Oscar, which consists of three generation stages: (a) geometry modeling, (b) appearance modeling, and (c) animation refinement.
  • Figure 3: The workflow of the proposed X-Oscar. First, we incorporate the adaptive perturbation into the 3D parameters, forming the avatar distribution. Next, we sample a set of parameters from the avatar distribution and render a 2D image. Finally, we apply avatar-aware noise to the rendered image for denoising to optimize 3D parameters.
  • Figure 4: Qualitative comparisons with SOTA text-to-avatar methods. The prompts (top → down) are "Gandalf from The Lord of the Rings", "Aladdin in Aladdin", and "Captain Jack Sparrow from Pirates of the Caribbean".
  • Figure 5: Qualitative comparisons with SOTA text-to-3D methods. The prompts (top → down) are "Anna in Frozen", "Hilary Clinton", and "Knight".
  • ...and 2 more figures