Table of Contents
Fetching ...

Multimodal Generation of Animatable 3D Human Models with AvatarForge

Xinhang Liu, Yu-Wing Tai, Chi-Keung Tang

TL;DR

AvatarForge addresses the challenge of generating realistic, animatable 3D human avatars from text or image prompts, a task where diffusion-based methods struggle due to body diversity and animation fidelity. It combines a large language model (LLM) agent for procedural generation with off-the-shelf 3D human generators, an auto-verification agent for iterative refinement, and a motion-control agent to animate avatars via natural language. The dynamic manual guides the LLM through high-dimensional parameter spaces, enabling fine-grained control over body shape, facial features, clothing, and poses. Experimental results show AvatarForge outperforms state-of-the-art text- and image-to-avatar methods in quality and customization, with interactive editing and real-time animation capabilities that promise broad applicability in gaming, film, and virtual environments.

Abstract

We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.

Multimodal Generation of Animatable 3D Human Models with AvatarForge

TL;DR

AvatarForge addresses the challenge of generating realistic, animatable 3D human avatars from text or image prompts, a task where diffusion-based methods struggle due to body diversity and animation fidelity. It combines a large language model (LLM) agent for procedural generation with off-the-shelf 3D human generators, an auto-verification agent for iterative refinement, and a motion-control agent to animate avatars via natural language. The dynamic manual guides the LLM through high-dimensional parameter spaces, enabling fine-grained control over body shape, facial features, clothing, and poses. Experimental results show AvatarForge outperforms state-of-the-art text- and image-to-avatar methods in quality and customization, with interactive editing and real-time animation capabilities that promise broad applicability in gaming, film, and virtual environments.

Abstract

We introduce AvatarForge, a framework for generating animatable 3D human avatars from text or image inputs using AI-driven procedural generation. While diffusion-based methods have made strides in general 3D object generation, they struggle with high-quality, customizable human avatars due to the complexity and diversity of human body shapes, poses, exacerbated by the scarcity of high-quality data. Additionally, animating these avatars remains a significant challenge for existing methods. AvatarForge overcomes these limitations by combining LLM-based commonsense reasoning with off-the-shelf 3D human generators, enabling fine-grained control over body and facial details. Unlike diffusion models which often rely on pre-trained datasets lacking precise control over individual human features, AvatarForge offers a more flexible approach, bringing humans into the iterative design and modeling loop, with its auto-verification system allowing for continuous refinement of the generated avatars, and thus promoting high accuracy and customization. Our evaluations show that AvatarForge outperforms state-of-the-art methods in both text- and image-to-avatar generation, making it a versatile tool for artistic creation and animation.

Paper Structure

This paper contains 14 sections, 7 figures.

Figures (7)

  • Figure 1: AvatarForge for generating customizable and animatable 3D human avatars. The approach takes text or image inputs to create lifelike human figures with diverse body shapes, poses, and facial expressions. AvatarForge enables intuitive human modeling by refining the avatars according to user-specific requirements or feedback provided in natural language.
  • Figure 2: Iterative refinement process in the agent-critic framework for 3D human avatar generation. The LLM agent interacts with the critic model during the procedural generation of a human avatar. In this example, the agent's first attempts fail to meet the required criteria. The feedback loop and dynamic manual updates are key to refining the avatar generation process.
  • Figure 3: Chain-of-Thought reasoning process for generating a basketball player avatar. The figure illustrates the LLM's sequential steps: the observation of the input description, formulation of a plan for avatar creation, self-reminders to avoid potential bugs, and the implementation of the plan through Python code adjustments for customization. This structured process enables the LLM to generate an accurate and detailed avatar based on the user's input.
  • Figure 4: Diverse 3D human avatars generated using AvatarForge, showcasing a variety of body types, outfits, and poses based on text descriptions. This demonstrates the framework's capability to create highly customizable and realistic human models.
  • Figure 5: Input images (left) and the output generated by AvatarForge (middle) showcasing reconstructed 3D avatars. The right images represent the recovered images. (On the other hand, manual effort is involved to achieve a similar visual effect.)
  • ...and 2 more figures