Table of Contents
Fetching ...

CharNeRF: 3D Character Generation from Concept Art

Eddy Chu, Yiyang Chen, Chedy Raissi, Anand Bhojan

TL;DR

CharNeRF tackles 3D character reconstruction from concept art by extending NeRF with a concept-art-aware pipeline. It encodes three-turnaround sketches using a double-hourglass encoder, fuses per-view features through a view-direction-attended multi-head self-attention module, and employs a mixed sampling strategy that includes surface sampling to guide shape learning. The approach yields improved novel-view synthesis and enables mesh extraction, demonstrated through quantitative metrics and ablations that validate the contribution of each component. Limitations include reduced fine-detail capture in highly variable hands due to data scarcity, with future work aimed at integrating diffusion priors and instance-specific fine-tuning to enhance realism and controllability.

Abstract

3D modeling holds significant importance in the realms of AR/VR and gaming, allowing for both artistic creativity and practical applications. However, the process is often time-consuming and demands a high level of skill. In this paper, we present a novel approach to create volumetric representations of 3D characters from consistent turnaround concept art, which serves as the standard input in the 3D modeling industry. While Neural Radiance Field (NeRF) has been a game-changer in image-based 3D reconstruction, to the best of our knowledge, there is no known research that optimizes the pipeline for concept art. To harness the potential of concept art, with its defined body poses and specific view angles, we propose encoding it as priors for our model. We train the network to make use of these priors for various 3D points through a learnable view-direction-attended multi-head self-attention layer. Additionally, we demonstrate that a combination of ray sampling and surface sampling enhances the inference capabilities of our network. Our model is able to generate high-quality 360-degree views of characters. Subsequently, we provide a simple guideline to better leverage our model to extract the 3D mesh. It is important to note that our model's inferencing capabilities are influenced by the training data's characteristics, primarily focusing on characters with a single head, two arms, and two legs. Nevertheless, our methodology remains versatile and adaptable to concept art from diverse subject matters, without imposing any specific assumptions on the data.

CharNeRF: 3D Character Generation from Concept Art

TL;DR

CharNeRF tackles 3D character reconstruction from concept art by extending NeRF with a concept-art-aware pipeline. It encodes three-turnaround sketches using a double-hourglass encoder, fuses per-view features through a view-direction-attended multi-head self-attention module, and employs a mixed sampling strategy that includes surface sampling to guide shape learning. The approach yields improved novel-view synthesis and enables mesh extraction, demonstrated through quantitative metrics and ablations that validate the contribution of each component. Limitations include reduced fine-detail capture in highly variable hands due to data scarcity, with future work aimed at integrating diffusion priors and instance-specific fine-tuning to enhance realism and controllability.

Abstract

3D modeling holds significant importance in the realms of AR/VR and gaming, allowing for both artistic creativity and practical applications. However, the process is often time-consuming and demands a high level of skill. In this paper, we present a novel approach to create volumetric representations of 3D characters from consistent turnaround concept art, which serves as the standard input in the 3D modeling industry. While Neural Radiance Field (NeRF) has been a game-changer in image-based 3D reconstruction, to the best of our knowledge, there is no known research that optimizes the pipeline for concept art. To harness the potential of concept art, with its defined body poses and specific view angles, we propose encoding it as priors for our model. We train the network to make use of these priors for various 3D points through a learnable view-direction-attended multi-head self-attention layer. Additionally, we demonstrate that a combination of ray sampling and surface sampling enhances the inference capabilities of our network. Our model is able to generate high-quality 360-degree views of characters. Subsequently, we provide a simple guideline to better leverage our model to extract the 3D mesh. It is important to note that our model's inferencing capabilities are influenced by the training data's characteristics, primarily focusing on characters with a single head, two arms, and two legs. Nevertheless, our methodology remains versatile and adaptable to concept art from diverse subject matters, without imposing any specific assumptions on the data.
Paper Structure (15 sections, 12 equations, 8 figures, 2 tables)

This paper contains 15 sections, 12 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: CharNeRF constructs neural radiance field for characters from concept art (front, side and back images in the left columns). CharNeRF is able to infer novel views of characters with complicated shapes. (in the right columns) First four rows are results for real concept arts created by 2D artists, while the last 3 rows use synthetic inputs created from existing 3D mesh.
  • Figure 2: Method Overview: CharNeRF first encodes concept art sketches into pixel aligned feature vectors. During rendering process, feature vectors are extracted and combined through view direction attended self attention layer. Combined features are then fed to the neural radiance field to guide the inference of color and volume density.
  • Figure 3: The marching cube algorithm is applied with three distinct resolutions to the same object, resulting in three different mesh representations that can be regarded as three different levels of detail. The polygon and vertex counts for each of these representations are provided above
  • Figure 4: The estimated densities have been computed as an average across varying numbers of camera angles. It is observed that an overall positive correlation exists between mesh quality and the number of camera angles employed. Mesh constructed using resolution 64x64x64
  • Figure 5: The overall network structure. $z_i$ represents the feature vectors extracted from the feature maps, and $x$ is positional encoded 3D coordinates and view direction. Both $x$ and $z_i$ are projected through an independent linear layer before their combination. Three combined information are passed through three single-view MLP blocks in parallel before being combined through a view direction attended multi head self attention combinator.
  • ...and 3 more figures