Table of Contents
Fetching ...

TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans

Aggelina Chatziagapi, Bindita Chaudhuri, Amit Kumar, Rakesh Ranjan, Dimitris Samaras, Nikolaos Sarafianos

TL;DR

TalkinNeRF presents a holistic 4D dynamic NeRF for full-body talking humans learned from monocular frontal videos. By conditioning body, hands, and face modules on per-frame pose and expression parameters and introducing a learnable hand deformation field, it achieves high-fidelity animation across multiple identities and unseen poses. The approach leverages a multi-identity code to enable joint training across subjects, dramatically speeding up training and improving robustness, while adapting to new identities from short videos. This yields state-of-the-art results in facial expressions, hand articulation, and lip-sync for full-body talking humans, with practical implications for AR/VR, virtual communication, and media production.

Abstract

We introduce a novel framework that learns a dynamic neural radiance field (NeRF) for full-body talking humans from monocular videos. Prior work represents only the body pose or the face. However, humans communicate with their full body, combining body pose, hand gestures, as well as facial expressions. In this work, we propose TalkinNeRF, a unified NeRF-based network that represents the holistic 4D human motion. Given a monocular video of a subject, we learn corresponding modules for the body, face, and hands, that are combined together to generate the final result. To capture complex finger articulation, we learn an additional deformation field for the hands. Our multi-identity representation enables simultaneous training for multiple subjects, as well as robust animation under completely unseen poses. It can also generalize to novel identities, given only a short video as input. We demonstrate state-of-the-art performance for animating full-body talking humans, with fine-grained hand articulation and facial expressions.

TalkinNeRF: Animatable Neural Fields for Full-Body Talking Humans

TL;DR

TalkinNeRF presents a holistic 4D dynamic NeRF for full-body talking humans learned from monocular frontal videos. By conditioning body, hands, and face modules on per-frame pose and expression parameters and introducing a learnable hand deformation field, it achieves high-fidelity animation across multiple identities and unseen poses. The approach leverages a multi-identity code to enable joint training across subjects, dramatically speeding up training and improving robustness, while adapting to new identities from short videos. This yields state-of-the-art results in facial expressions, hand articulation, and lip-sync for full-body talking humans, with practical implications for AR/VR, virtual communication, and media production.

Abstract

We introduce a novel framework that learns a dynamic neural radiance field (NeRF) for full-body talking humans from monocular videos. Prior work represents only the body pose or the face. However, humans communicate with their full body, combining body pose, hand gestures, as well as facial expressions. In this work, we propose TalkinNeRF, a unified NeRF-based network that represents the holistic 4D human motion. Given a monocular video of a subject, we learn corresponding modules for the body, face, and hands, that are combined together to generate the final result. To capture complex finger articulation, we learn an additional deformation field for the hands. Our multi-identity representation enables simultaneous training for multiple subjects, as well as robust animation under completely unseen poses. It can also generalize to novel identities, given only a short video as input. We demonstrate state-of-the-art performance for animating full-body talking humans, with fine-grained hand articulation and facial expressions.
Paper Structure (14 sections, 10 equations, 12 figures, 4 tables)

This paper contains 14 sections, 10 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of TalkinNeRF. Given a monocular video of a subject, we learn a unified NeRF-based network that represents their holistic 4D motion. Corresponding modules for body, face, and hands are combined together, in order to synthesize the final full-body talking human. By learning an identity code per video, our method can be trained on multiple identities simultaneously.
  • Figure 2: (a) Ablation study on the segmentation classes. Our method predicts 5 classes: head, hands, arms, body, background. If the arms are considered as body (4 classes - "W/o arms"), we observe disconnected arms in the reconstruction results. (b) Ablation study on the hand representation when rendering novel (unseen) poses. Without learning a hand deformation field $D_{\text{hands}}$, without considering the output of Body MLP as background for the hands ("W/o background"), and Ours (With $D_{\text{hands}}$ and background).
  • Figure 3: Qualitative comparison for rendering novel poses from the same identity. We compare with HumanNeRF humannerf and MonoHuman yu2023monohuman. Ground truth (not seen in training) is shown on the left. Our method generates facial expressions and hand articulation with a high fidelity.
  • Figure 4: Qualitative comparison for rendering novel poses from a different identity. From left to right: target pose, results of HumanNeRF humannerf, MonoHuman yu2023monohuman, our single-identity model, and our multi-identity model. Our multi-identity TalkinNeRF robustly renders each identity under unseen poses and expressions.
  • Figure 5: Qualitative comparison for rendering unseen out-of-distribution poses. From left to right: target pose, results of HumanNeRF humannerf, MonoHuman yu2023monohuman, our single-identity, and our multi-identity model. Our multi-identity TalkinNeRF robustly renders each identity under completely unseen poses and expressions.
  • ...and 7 more figures