Table of Contents
Fetching ...

Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion

Zeyu Zhang, Yiran Wang, Biao Wu, Shuo Chen, Zhiyuan Zhang, Shiya Huang, Wenbo Zhang, Meng Fang, Ling Chen, Yang Zhao

TL;DR

Motion Avatar tackles the challenge of jointly generating high-quality human and animal avatars along with their motions from text. It deploys an LLM-planner to coordinate motion and mesh generation, a two-stage MoMask-based motion model, and an image-to-3D avatar pipeline that culminates in riggable meshes. The introduction of Zoo-300K and the ZooGen pipeline addresses critical data gaps for animal motion, while Avatar Q&A and HumanML3D support robust evaluation. Together these contributions enable end-to-end, text-driven dynamic avatars with broad potential for film, games, AR/VR, and robotics.

Abstract

In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community. See project website https://steve-zeyu-zhang.github.io/MotionAvatar/

Motion Avatar: Generate Human and Animal Avatars with Arbitrary Motion

TL;DR

Motion Avatar tackles the challenge of jointly generating high-quality human and animal avatars along with their motions from text. It deploys an LLM-planner to coordinate motion and mesh generation, a two-stage MoMask-based motion model, and an image-to-3D avatar pipeline that culminates in riggable meshes. The introduction of Zoo-300K and the ZooGen pipeline addresses critical data gaps for animal motion, while Avatar Q&A and HumanML3D support robust evaluation. Together these contributions enable end-to-end, text-driven dynamic avatars with broad potential for film, games, AR/VR, and robotics.

Abstract

In recent years, there has been significant interest in creating 3D avatars and motions, driven by their diverse applications in areas like film-making, video games, AR/VR, and human-robot interaction. However, current efforts primarily concentrate on either generating the 3D avatar mesh alone or producing motion sequences, with integrating these two aspects proving to be a persistent challenge. Additionally, while avatar and motion generation predominantly target humans, extending these techniques to animals remains a significant challenge due to inadequate training data and methods. To bridge these gaps, our paper presents three key contributions. Firstly, we proposed a novel agent-based approach named Motion Avatar, which allows for the automatic generation of high-quality customizable human and animal avatars with motions through text queries. The method significantly advanced the progress in dynamic 3D character generation. Secondly, we introduced a LLM planner that coordinates both motion and avatar generation, which transforms a discriminative planning into a customizable Q&A fashion. Lastly, we presented an animal motion dataset named Zoo-300K, comprising approximately 300,000 text-motion pairs across 65 animal categories and its building pipeline ZooGen, which serves as a valuable resource for the community. See project website https://steve-zeyu-zhang.github.io/MotionAvatar/
Paper Structure (21 sections, 2 equations, 5 figures, 4 tables)

This paper contains 21 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The diagram illustrates the process of our proposed ZooGen. Initially, SinMDM raab2023single is employed to edit and enhance motion within Truebones Zoo free. Subsequently, Video-LLaMA zhang2023video is utilized to describe the motion in a paragraph, followed by refinement using LLaMA-70B touvron2023llama. Finally, human review is conducted on the motion captions, which are then gathered as textual descriptions in the Zoo-300K dataset.
  • Figure 2: Motion Avatar utilizes a LLM-agent based approach to manage user queries and produce tailored prompts. These prompts are designed to facilitate both the generation of motion sequences and the creation of 3D meshes. Motion generation follows an autoregressive process, while mesh generation operates within an image-to-3D framework. Subsequently, the generated mesh undergoes an automatic rigging process, allowing the motion to be retargeted to the rigged mesh.
  • Figure 3: The figure illustrates various examples of animal motion generated by Motion Avatar, demonstrating its ability to produce high-quality motion and mesh for both human and animal characters.
  • Figure 4: The figure showcases various examples of generated 3D meshes, encompassing both human and animal avatars. The meshes exhibit high-quality geometry and offer customizable textures, thus serving as a robust foundation for avatar animation. Furthermore, this advancement holds promise for enhancing the technique's applicability in real-world scenarios.
  • Figure 5: This figure displays the User Interface (UI) used in our User Study, showcasing four videos (Video A to D) each with distinct motion animations from various models. Participants evaluate these animations on aspects such as motion accuracy, mesh quality, integration of motion and mesh, and overall user experience. They rate each aspect from 1 (low) to 5 (high) to assess how the animations mirror real-world movements, the visual appeal of the animations, their integration quality, and their engagement level. This evaluation aims to determine the realism and engagement effectiveness of each animation model.