Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Ziqian Bai; Feitong Tan; Sean Fanello; Rohit Pandey; Mingsong Dou; Shichen Liu; Ping Tan; Yinda Zhang

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Ziqian Bai, Feitong Tan, Sean Fanello, Rohit Pandey, Mingsong Dou, Shichen Liu, Ping Tan, Yinda Zhang

TL;DR

This paper introduces mesh-anchored hash table blendshapes for efficient, controllable 3D implicit head avatars. By attaching multiple small hash tables to each 3DMM vertex and predicting per-vertex blendweights via a UV-space CNN, the method forms expression-dependent embeddings that feed a lightweight NeRF decoder, drastically reducing compute. A hierarchical $k$-NN search accelerates embedding retrieval, enabling real-time rendering (>30 FPS) at $512\times512$ while maintaining high fidelity, even for challenging expressions. Trained from monocular RGB videos, the approach delivers a favorable accuracy-speed trade-off against state-of-the-art high-quality and efficient avatars, with ablations validating the importance of local hash tables and the hierarchical search. Limitations include floaters and some instability in mouth interiors, suggesting avenues for training-based regularization and refinement.

Abstract

3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However, the computational cost of these methods remains a significant barrier to their widespread adoption, particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes, these methods cannot be simply employed to support realistic facial expressions, such as in the case of a dynamic facial performance. To address these challenges, we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes, which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN, resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP, which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions.

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

TL;DR

-NN search accelerates embedding retrieval, enabling real-time rendering (>30 FPS) at

while maintaining high fidelity, even for challenging expressions. Trained from monocular RGB videos, the approach delivers a favorable accuracy-speed trade-off against state-of-the-art high-quality and efficient avatars, with ablations validating the importance of local hash tables and the hierarchical search. Limitations include floaters and some instability in mouth interiors, suggesting avenues for training-based regularization and refinement.

Abstract

Paper Structure (31 sections, 5 equations, 10 figures, 5 tables)

This paper contains 31 sections, 5 equations, 10 figures, 5 tables.

Introduction
Related Work
Method
Mesh-anchored Hash Table Blendshapes
Merge Mesh-anchored Blendshapes
Neural Radiance Field Decoding
Hierarchical $k$-nearest-neighbor Search
Training
Experiments
Data and Metrics
Data.
Metrics.
Comparison to State-of-the-art
Ablation Study
Mesh-anchored Hash Table Blendshapes
...and 16 more sections

Figures (10)

Figure 1: Visual Quality vs. Rendering Speed comparisons between our monocular 3D head avatars and prior state-of-the-art methods including NeRFBlendshape Gao2022nerfblendshape, INSTA zielonka2023instant, PointAvatar zheng2023pointavatar and MonoAvatar bai2023learning. Our approach achieves real-time rendering (i.e.$> 30$ mean FPS) with a $512 \times 512$ resolution, while produces comparable visual quality comparing to prior SOTAs bai2023learning, and gives significantly better results on challenging expressions than prior efficient avatars Gao2022nerfblendshapezielonka2023instantzheng2023pointavatar. Webpage https://augmentedperception.github.io/monoavatar-plus/
Figure 2: Overview of our pipeline. Our core avatar representation is Mesh-anchored Hash Table Blendshapes (\ref{['sec:hash_blendshape']}), where multiple small hash tables are attached to each 3DMM vertex. During inference, our method starts from a displacement map encoding the facial expression, which is then fed into a U-Net to predict hash table weights and per-vertex features. The predicted weights are used to linearly combine the multiple hash tables attached on each 3DMM vertex (\ref{['sec:merge_blendshape']}). During volumetric rendering (\ref{['sec:nerf_dec']} and \ref{['fig:nerf_dec']}), for each query point, we search its $k$-nearest-neighbor vertices, then pull embeddings from the merged hash tables and concatenate with the per-vertex feature to decode local density and color via a tiny MLP with two hidden layers.
Figure 3: Pipeline of Neural Radiance Field Decoding (\ref{['sec:nerf_dec']}). Our method only needs a very lightweight MLP for NeRF inference, where a two-hidden-layer MLP is used to predict the density and the color of each query point.
Figure 4: Comparisons on the rendering quality to previous state-of-the-art methods. From left to right, each column contains the images of: 1) PointAvatar zheng2023pointavatar, 2) INSTA zielonka2023instant, 3) NeRFBlendshape Gao2022nerfblendshape, 4) MonoAvatar bai2023learning, 5) Ours, 6) Ground Truth. Our method faithfully reconstructs the personalized expressions and high-frequency details, achieving one of the best rendering quality with real-time rendering speed.
Figure 5: Qualitative results of our full model and various alternative design choices to demonstrate the necessity of our mesh-anchored hash table blendshapes. Please refer to \ref{['sec:ablation_hash_blendshape']} for details of each alternative design choice. "Static Hash + 3DMM Param" gives unfaithful expressions and obvious artifacts around eyes and ears. "Static Hash + UV CNN" produces more faithful expressions, but still suffers from floaters. By utilizing the blendshape formulation of hash tables, "3 Blendshapes" eliminates most of the artifacts, but produces blurriness on the details of eyes and ears. "Ours" gives the best rendering quality by increasing the number of blendshapes to $5$, leading to cleaner details of the eyelids and ear boundary.
...and 5 more figures

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

TL;DR

Abstract

Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes

Authors

TL;DR

Abstract

Table of Contents

Figures (10)