Table of Contents
Fetching ...

CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

Hao-Yang Peng, Jia-Peng Zhang, Meng-Hao Guo, Yan-Pei Cao, Shi-Min Hu

TL;DR

This work tackles the challenge of generating riggable 3D characters from a single image amid pose variation and self-occlusion. It introduces a two-stage pipeline: an image-conditioned multi-view diffusion model for pose canonicalization to an A-pose and a transformer-based sparse-view reconstruction for detailed geometry and textures, followed by texture refinement. A new Anime3D dataset of nearly 14k stylized characters with multi-view and pose diversity enables robust training and evaluation. The approach delivers fast, high-quality 3D character meshes suitable for downstream animation, with strong improvements in multi-view consistency, texture fidelity, and rigging readiness over state-of-the-art baselines.

Abstract

In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.

CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

TL;DR

This work tackles the challenge of generating riggable 3D characters from a single image amid pose variation and self-occlusion. It introduces a two-stage pipeline: an image-conditioned multi-view diffusion model for pose canonicalization to an A-pose and a transformer-based sparse-view reconstruction for detailed geometry and textures, followed by texture refinement. A new Anime3D dataset of nearly 14k stylized characters with multi-view and pose diversity enables robust training and evaluation. The approach delivers fast, high-quality 3D character meshes suitable for downstream animation, with strong improvements in multi-view consistency, texture fidelity, and rigging readiness over state-of-the-art baselines.

Abstract

In the field of digital content creation, generating high-quality 3D characters from single images is challenging, especially given the complexities of various body poses and the issues of self-occlusion and pose ambiguity. In this paper, we present CharacterGen, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model. This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of our approach, facilitating the creation of detailed 3D models from multi-view images. We also adopt a texture-back-projection strategy to produce high-quality texture maps. Additionally, we have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate our model. Our approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.
Paper Structure (29 sections, 3 equations, 10 figures, 4 tables)

This paper contains 29 sections, 3 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: An example character from our Anime3D dataset from four different camera views, demonstrates how we organize the image pairs during training to extend UNet's ability to determine a canonical pose.
  • Figure 2: Pipeline for generating four views of consistent images, showing how our IDUNet extracts local pixel-level features to strengthen the multi-view UNet. Here "Q", "K", and "V" denote the query, key, and value matrix in the attention mechanism.
  • Figure 3: Pipeline for generating a final refined character mesh from generated multi-view images. In the first stage, we utilize a deep transformer-based network to generate a character with a coarse texture and then use a texture back-projection strategy to enhance the appearance of the generated mesh.
  • Figure 4: We compare our generated four A-pose character images with other methods. The azimuths for all examples are set as $\{0^\circ, 90^\circ, 180^\circ, 270^\circ\}$. $\copyright$ kinoko7
  • Figure 5: We compare the appearance and geometry of our generated 3D characters with other methods. $\copyright$ kinoko7
  • ...and 5 more figures