Table of Contents
Fetching ...

Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesis

Wieland Morgenstern, Milena T. Bagdasarian, Anna Hilsmann, Peter Eisert

Abstract

We propose a novel representation of virtual humans for highly realistic real-time animation and rendering in 3D applications. We learn pose dependent appearance and geometry from highly accurate dynamic mesh sequences obtained from state-of-the-art multiview-video reconstruction. Learning pose-dependent appearance and geometry from mesh sequences poses significant challenges, as it requires the network to learn the intricate shape and articulated motion of a human body. However, statistical body models like SMPL provide valuable a-priori knowledge which we leverage in order to constrain the dimension of the search space enabling more efficient and targeted learning and define pose-dependency. Instead of directly learning absolute pose-dependent geometry, we learn the difference between the observed geometry and the fitted SMPL model. This allows us to encode both pose-dependent appearance and geometry in the consistent UV space of the SMPL model. This approach not only ensures a high level of realism but also facilitates streamlined processing and rendering of virtual humans in real-time scenarios.

Animatable Virtual Humans: Learning pose-dependent human representations in UV space for interactive performance synthesis

Abstract

We propose a novel representation of virtual humans for highly realistic real-time animation and rendering in 3D applications. We learn pose dependent appearance and geometry from highly accurate dynamic mesh sequences obtained from state-of-the-art multiview-video reconstruction. Learning pose-dependent appearance and geometry from mesh sequences poses significant challenges, as it requires the network to learn the intricate shape and articulated motion of a human body. However, statistical body models like SMPL provide valuable a-priori knowledge which we leverage in order to constrain the dimension of the search space enabling more efficient and targeted learning and define pose-dependency. Instead of directly learning absolute pose-dependent geometry, we learn the difference between the observed geometry and the fitted SMPL model. This allows us to encode both pose-dependent appearance and geometry in the consistent UV space of the SMPL model. This approach not only ensures a high level of realism but also facilitates streamlined processing and rendering of virtual humans in real-time scenarios.
Paper Structure (15 sections, 2 equations, 4 figures, 1 table)

This paper contains 15 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The four images on the left show a reconstructed model from a performance capture in a fixed pose ($S_t$), the shadow model ($M_t^S$), the registered ($M_t^R$) model with the projected texture and finally the added displacement map. This process leaves artifacts in the texture $c$, and the computed displacement $d$ is noisy when re-applied. The two images on the right depict the animatable representation with texture $c'$ and displacement $d'$, which we can synthesize in novel poses. Here, rendered in the same pose as the captured frame to allow for direct visual comparison. It is apparent that the neural network has learned to synthesize texture $c'$ and displacement $d'$ without the artifacts seen in the projection, while keeping many of the details of the capture.
  • Figure 2: The content from a textured mesh sequence is projected onto a shadow model. The resulting textures, correspondence confidence map, displacement map and surface visibility score are used to train a network to synthesize this data, given a specific pose. In the network, the pose is encoded into a latent space, which selects features from a feature cuboid, which are then successively upsampled into full-size texture and displacement maps.
  • Figure 3: Selecting performance capture frames by pose variance
  • Figure 4: Performance capture frames versus animatable virtual human (AVH) in a grabbing (left two images) and a stop position (right to images).