XHand: Real-time Expressive Hand Avatar

Qijun Gan; Zijie Zhou; Jianke Zhu

XHand: Real-time Expressive Hand Avatar

Qijun Gan, Zijie Zhou, Jianke Zhu

TL;DR

XHand tackles the problem of real-time rendering of expressive, high-fidelity hand avatars. It introduces three feature embedding modules to predict vertex displacements, albedo, and LBS weights on a subdivided MANO template, coupled with a mesh-based neural renderer and a part-aware Laplace smoothing regularizer. The approach achieves state-of-the-art rendering quality and geometry fidelity on InterHand2.6M and DeepHandMesh, running at real-time speeds and outperforming prior volumetric and mesh-based methods. This work advances immersive XR and gaming capabilities by delivering detailed, pose-consistent hand avatars with efficient, end-to-end training and rendering pipelines.

Abstract

Hand avatars play a pivotal role in a wide array of digital interfaces, enhancing user immersion and facilitating natural interaction within virtual environments. While previous studies have focused on photo-realistic hand rendering, little attention has been paid to reconstruct the hand geometry with fine details, which is essential to rendering quality. In the realms of extended reality and gaming, on-the-fly rendering becomes imperative. To this end, we introduce an expressive hand avatar, named XHand, that is designed to comprehensively generate hand shape, appearance, and deformations in real-time. To obtain fine-grained hand meshes, we make use of three feature embedding modules to predict hand deformation displacements, albedo, and linear blending skinning weights, respectively. To achieve photo-realistic hand rendering on fine-grained meshes, our method employs a mesh-based neural renderer by leveraging mesh topological consistency and latent codes from embedding modules. During training, a part-aware Laplace smoothing strategy is proposed by incorporating the distinct levels of regularization to effectively maintain the necessary details and eliminate the undesired artifacts. The experimental evaluations on InterHand2.6M and DeepHandMesh datasets demonstrate the efficacy of XHand, which is able to recover high-fidelity geometry and texture for hand animations across diverse poses in real-time. To reproduce our results, we will make the full implementation publicly available at https://github.com/agnJason/XHand.

XHand: Real-time Expressive Hand Avatar

TL;DR

Abstract

Paper Structure (16 sections, 19 equations, 10 figures, 5 tables)

This paper contains 16 sections, 19 equations, 10 figures, 5 tables.

Introduction
Relate Work
Parametric Model-based Method
Model-free Approach
Neural Hand Representation
Generic Animatable Objects
Method
Detailed Hand Representation
Mesh Rendering
Training Process
Experiments
Datasets
Experimental Setup
Experimental Results
Ablation Study
...and 1 more sections

Figures (10)

Figure 1: We present XHand, a rigged hand avatar that captures the geometry, appearance and poses of the hand. XHand is created from multi-view videos and utilizes MANO pose parameters (the first image in each group of (a)) to generate high-detail meshes (the second) and renderings (the third). XHand generates photo-realistic hand images in real-time for a given pose sequence (b). (c) is an example of animated personalized hand avatars according to poses pavlakos2023reconstructing in the wild images.
Figure 2: Overview of XHand. Given a hand pose $\theta$, XHand utilizes three feature embedding modules to obtain the displacement field $D$, Linear Blending Skinning (LBS) weights $W$, and albedo $\rho$. These features are applied to the subdivided MANO template $\bar{\mathcal{M}}'$, resulting in a detailed geometric hand mesh. Leveraging mesh-based neural rendering, we achieve photo-realistic renderings. XHand can generate both detailed geometry and realistic rendering in real-time.
Figure 3: The comparison of mesh refinement and texture between the original MANO and the subdivided MANO.
Figure 4: Our proposed feature embedding module. Pose $\theta_t$ at time $t$ is decoded by pose decoder $\Phi$ along with latent code $Q$ for each vertex and mapped to the feature space through mapping matrix $\mathcal{K}$. Finally, it is combined with the average vertex feature $\bar{f}_\mathcal{M}$ to obtain the feature for the pose $\theta$. We introduce three different feature embedding modules $\Psi_{lbs}$, $\Psi_{D}$ and $\Psi_{\rho}$ to predict LBS weights $W'$, displacements $D$ and albedo $\rho$.
Figure 5: The structure of neural renderer$\mathcal{C}$, where $*$ donates positional encoding aliev2020neural.
...and 5 more figures

XHand: Real-time Expressive Hand Avatar

TL;DR

Abstract

XHand: Real-time Expressive Hand Avatar

Authors

TL;DR

Abstract

Table of Contents

Figures (10)