Table of Contents
Fetching ...

DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

Qingxuan Wu, Zhiyang Dou, Sirui Xu, Soshi Shimada, Chen Wang, Zhengming Yu, Yuan Liu, Cheng Lin, Zeyu Cao, Taku Komura, Vladislav Golyanik, Christian Theobalt, Wenping Wang, Lingjie Liu

TL;DR

DICE tackles monocular hand-face deformation reconstruction by introducing an end-to-end Transformer-based framework that jointly estimates hand/face poses, contacts, and deformations from a single image. It uses a two-branch architecture (MeshNet for global geometry and InteractionNet for local deformations and contacts) plus an IKNet to map non-parametric mesh outputs to animatable hand/face parameters, enabling efficient, plausible mesh recovery. A weakly-supervised training pipeline augments studio data with 2D keypoints, diffusion-based depth priors, and adversarial pose priors to improve generalization to in-the-wild images. Experiments show state-of-the-art accuracy, robust physical plausibility, and interactive speed (20 fps) on standard benchmarks and real-world images, highlighting strong potential for AR/VR and animation pipelines. The method also provides an animatable parametric representation suitable for downstream applications, with clear paths for scaling and tackling remaining occlusion-related challenges.

Abstract

Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image

TL;DR

DICE tackles monocular hand-face deformation reconstruction by introducing an end-to-end Transformer-based framework that jointly estimates hand/face poses, contacts, and deformations from a single image. It uses a two-branch architecture (MeshNet for global geometry and InteractionNet for local deformations and contacts) plus an IKNet to map non-parametric mesh outputs to animatable hand/face parameters, enabling efficient, plausible mesh recovery. A weakly-supervised training pipeline augments studio data with 2D keypoints, diffusion-based depth priors, and adversarial pose priors to improve generalization to in-the-wild images. Experiments show state-of-the-art accuracy, robust physical plausibility, and interactive speed (20 fps) on standard benchmarks and real-world images, highlighting strong potential for AR/VR and animation pipelines. The method also provides an animatable parametric representation suitable for downstream applications, with clear paths for scaling and tackling remaining occlusion-related challenges.

Abstract

Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.

Paper Structure

This paper contains 30 sections, 19 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our method is the first end-to-end approach that captures hand-face interaction and deformation from a monocular image. Results are from (a) Decaf's validation dataset, (b) in-the-wild images, and (c) VR use cases.
  • Figure 2: Overview of the proposed DICE framework. The input image is first fed to a CNN to extract a feature map, which is then passed to the Transformer-based encoders for mesh and interaction, i.e., MeshNet and InteractionNet. MeshNet extracts hand and face mesh features, which are then used by the Inverse Kinematics models (IKNets) to predict pose and shape parameters that drive FLAME li2017learning and MANO romero2022embodied models. InteractionNet predicts per-vertex hand-face contact probabilities and face deformation fields from the feature map, where the latter is applied to the face mesh output by the FLAME model. To improve the generalization capability, we introduce a weakly-supervised training scheme using off-the-shelf 2D keypoint detection models lugaresi2019mediapipebulat2017far and depth estimation models ke2023repurposing to provide depth supervision on keypoints. In addition, we use face and hand discriminators to constrain the distribution of parameters regressed by IKNets.
  • Figure 3: Qualitative results of hand-face interaction, deformation, and contact recovery by DICE on Decaf and in-the-wild images. In contact visualizations, a deeper color indicates a higher contact probability.
  • Figure 4: Qualitative comparison of DICE, Decaf shimada2023decaf, PIXIE feng2021collaborative (whole-body version), METRO* lin2021mesh on Decaf validation set and in-the-wild images. Our method achieves superior reconstruction accuracy and plausibility in the Decaf shimada2023decaf dataset, especially generalizing well to difficult in-the-wild actions unseen in Decaf compared to all baselines.
  • Figure 5: Structural details of the MeshNet and InteractionNet. (a) MeshNet; (b) InteractionNet; (c) Internal structure of a Transformer Encoder block.
  • ...and 4 more figures