Table of Contents
Fetching ...

Democratizing the Creation of Animatable Facial Avatars

Yilin Zhu, Dalton Omens, Haodi He, Ron Fedkiw

TL;DR

This work tackles democratizing the creation of animatable facial avatars by removing dependence on light stages and high-end capture hardware. It introduces a pipeline that warps real-world RGB images onto a template avatar to bake surrogate lighting into texture, enabling accurate geometry refinement and de-lit texture suitable for common graphics pipelines; it further couples this with multi-view optimization, proper texture alignment, and inverse rendering techniques. A key contribution is the combination of a Simon Says–driven capture of expressions with a Mesh2MetaHuman–based rig, including a volumetric morph to adapt the standard rig to individual geometry and motion signatures, plus a video-based extension for subjects without available imagery. Collectively, the method yields person-specific animatable rigs that preserve likeness and motion while leveraging consumer hardware, greatly broadening access to realistic facial avatars for games, AR/VR, and communications.

Abstract

In high-end visual effects pipelines, a customized (and expensive) light stage system is (typically) used to scan an actor in order to acquire both geometry and texture for various expressions. Aiming towards democratization, we propose a novel pipeline for obtaining geometry and texture as well as enough expression information to build a customized person-specific animation rig without using a light stage or any other high-end hardware (or manual cleanup). A key novel idea consists of warping real-world images to align with the geometry of a template avatar and subsequently projecting the warped image into the template avatar's texture; importantly, this allows us to leverage baked-in real-world lighting/texture information in order to create surrogate facial features (and bridge the domain gap) for the sake of geometry reconstruction. Not only can our method be used to obtain a neutral expression geometry and de-lit texture, but it can also be used to improve avatars after they have been imported into an animation system (noting that such imports tend to be lossy, while also hallucinating various features). Since a default animation rig will contain template expressions that do not correctly correspond to those of a particular individual, we use a Simon Says approach to capture various expressions and build a person-specific animation rig (that moves like they do). Our aforementioned warping/projection method has high enough efficacy to reconstruct geometry corresponding to each expressions.

Democratizing the Creation of Animatable Facial Avatars

TL;DR

This work tackles democratizing the creation of animatable facial avatars by removing dependence on light stages and high-end capture hardware. It introduces a pipeline that warps real-world RGB images onto a template avatar to bake surrogate lighting into texture, enabling accurate geometry refinement and de-lit texture suitable for common graphics pipelines; it further couples this with multi-view optimization, proper texture alignment, and inverse rendering techniques. A key contribution is the combination of a Simon Says–driven capture of expressions with a Mesh2MetaHuman–based rig, including a volumetric morph to adapt the standard rig to individual geometry and motion signatures, plus a video-based extension for subjects without available imagery. Collectively, the method yields person-specific animatable rigs that preserve likeness and motion while leveraging consumer hardware, greatly broadening access to realistic facial avatars for games, AR/VR, and communications.

Abstract

In high-end visual effects pipelines, a customized (and expensive) light stage system is (typically) used to scan an actor in order to acquire both geometry and texture for various expressions. Aiming towards democratization, we propose a novel pipeline for obtaining geometry and texture as well as enough expression information to build a customized person-specific animation rig without using a light stage or any other high-end hardware (or manual cleanup). A key novel idea consists of warping real-world images to align with the geometry of a template avatar and subsequently projecting the warped image into the template avatar's texture; importantly, this allows us to leverage baked-in real-world lighting/texture information in order to create surrogate facial features (and bridge the domain gap) for the sake of geometry reconstruction. Not only can our method be used to obtain a neutral expression geometry and de-lit texture, but it can also be used to improve avatars after they have been imported into an animation system (noting that such imports tend to be lossy, while also hallucinating various features). Since a default animation rig will contain template expressions that do not correctly correspond to those of a particular individual, we use a Simon Says approach to capture various expressions and build a person-specific animation rig (that moves like they do). Our aforementioned warping/projection method has high enough efficacy to reconstruct geometry corresponding to each expressions.
Paper Structure (20 sections, 6 equations, 19 figures)

This paper contains 20 sections, 6 equations, 19 figures.

Figures (19)

  • Figure 1: From left to right: geometry, same geometry textured from a front-facing image, same geometry/texture as seen from a three-quarters view. This emphasizes how misleading a textured geometry can be when not considered from significantly novel views.
  • Figure 2: From top left to top right: avatar geometry (derived from the geometry in Figure \ref{['fig:good_tex_bad_geo']}), same geometry textured from a front-facing image, same geometry/texture with a smile expression. This emphasizes how misleading a textured geometry can be when not considering various expressions. Bottom: zoomed-in view of the top middle and top right figures. Note in particular how part of the bottom lip texture (and the crease between the lips) appears on the top lip.
  • Figure 3: From left to right: (a) captured image with a (blue) tracing of the silhouette, nostril, and lip corner, (b) initial triangle mesh created from the front view, (c) pixel-aligned projection of the front view triangle mesh onto the rough scan (with the aid of Laplacian smoothing), (d) accounting for silhouette boundaries of adjacent views, (e) MetaHuman reconstruction (note how fitting to a template hallucinates and modifies geometry).
  • Figure 4: A real-world image (shown in the first figure) is warped (in image space) to better match a synthetic rendering of a current guess for the geometry (using an appropriate texture). A zoomed-in view of the real-world image is shown before (second figure) and after (third figure) the warp. The fourth figure shows the current geometry, and the fifth figure shows the result obtained by projecting the warped real-world image onto that geometry. This new texture contains baked-in lighting that provides surrogate features useful when optimizing the synthetic geometry to match the original unwarped image.
  • Figure 5: From left to right: (a) captured image with a (blue) tracing of the silhouette, nostril, lip corner, eye corner, and mouth corner, (b) MetaHuman reconstruction from section \ref{['sec:initialgeo']} Figure \ref{['fig:initial_recon']}, (c) the results obtained by using our geometry refinement process (in section \ref{['sec:neutralgeo']}). In particular, note the improvements in the eye and mouth regions.
  • ...and 14 more figures