Table of Contents
Fetching ...

MI-NeRF: Learning a Single Face NeRF from Multiple Identities

Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras

TL;DR

This work tackles multi-identity dynamic NeRF for talking faces by training a single network across identities. It introduces a multiplicative interaction module to capture nonlinear identity-expression coupling, enabling disentanglement and robust synthesis of novel expressions. The approach achieves substantial training-time savings (up to ~90%) and supports personalization from short video clips, delivering state-of-the-art performance in facial expression transfer and lip-synced video synthesis across identities. The method is practical for large-scale deployment and can be extended to thousands of identities with further research into multi-identity generalization.

Abstract

In this work, we introduce a method that learns a single dynamic neural radiance field (NeRF) from monocular talking face videos of multiple identities. NeRFs have shown remarkable results in modeling the 4D dynamics and appearance of human faces. However, they require per-identity optimization. Although recent approaches have proposed techniques to reduce the training and rendering time, increasing the number of identities can be expensive. We introduce MI-NeRF (multi-identity NeRF), a single unified network that models complex non-rigid facial motion for multiple identities, using only monocular videos of arbitrary length. The core premise in our method is to learn the non-linear interactions between identity and non-identity specific information with a multiplicative module. By training on multiple videos simultaneously, MI-NeRF not only reduces the total training time compared to standard single-identity NeRFs, but also demonstrates robustness in synthesizing novel expressions for any input identity. We present results for both facial expression transfer and talking face video synthesis. Our method can be further personalized for a target identity given only a short video.

MI-NeRF: Learning a Single Face NeRF from Multiple Identities

TL;DR

This work tackles multi-identity dynamic NeRF for talking faces by training a single network across identities. It introduces a multiplicative interaction module to capture nonlinear identity-expression coupling, enabling disentanglement and robust synthesis of novel expressions. The approach achieves substantial training-time savings (up to ~90%) and supports personalization from short video clips, delivering state-of-the-art performance in facial expression transfer and lip-synced video synthesis across identities. The method is practical for large-scale deployment and can be extended to thousands of identities with further research into multi-identity generalization.

Abstract

In this work, we introduce a method that learns a single dynamic neural radiance field (NeRF) from monocular talking face videos of multiple identities. NeRFs have shown remarkable results in modeling the 4D dynamics and appearance of human faces. However, they require per-identity optimization. Although recent approaches have proposed techniques to reduce the training and rendering time, increasing the number of identities can be expensive. We introduce MI-NeRF (multi-identity NeRF), a single unified network that models complex non-rigid facial motion for multiple identities, using only monocular videos of arbitrary length. The core premise in our method is to learn the non-linear interactions between identity and non-identity specific information with a multiplicative module. By training on multiple videos simultaneously, MI-NeRF not only reduces the total training time compared to standard single-identity NeRFs, but also demonstrates robustness in synthesizing novel expressions for any input identity. We present results for both facial expression transfer and talking face video synthesis. Our method can be further personalized for a target identity given only a short video.
Paper Structure (26 sections, 3 theorems, 14 equations, 15 figures, 5 tables)

This paper contains 26 sections, 3 theorems, 14 equations, 15 figures, 5 tables.

Key Result

Proposition 1

The function $M$ of eq:multA captures multiplicative interactions.

Figures (15)

  • Figure 1: Overview of MI-NeRF. Given monocular talking face videos from multiple identities, MI-NeRF learns a single network to model their 4D geometry and appearance. A multiplicative module with shared weights across all identities learns non-linear interactions between identity codes and facial expressions. MI-NeRF can synthesize high-quality videos of any input identity.
  • Figure 2: Ablation Study. Qualitative comparison of MI-NeRF with Baseline NeRF that concatenates all input conditions, without using any multiplicative module, and leads to poor disentanglement. Our proposed multiplicative module demonstrates robustness, disentangling between identity and expression.
  • Figure 3: Transferring Novel Expressions. Qualitative comparison of MI-NeRF with state-of-the-art approaches when transferring unseen expressions to a target identity. NeRFace nerface is a single-identity NeRF, INSTA zielonka2023insta is a single-identity geometry-guided deformable NeRF, and HeadNeRF hong2022headnerf is a NeRF-based parametric head model trained on a large dataset. Our method demonstrates robustness in synthesizing novel (unseen) expressions for any input identity.
  • Figure 4: Left: Total training time vs total number of identities. Standard single-identity NeRFs, like NeRFace nerface, AD-NeRF adnerf, and LipNeRF lipnerf require approximately 40 hours training per identity. On the contrary, our MI-NeRF (generic) can be trained on 100 identities in 80 hours, leading to a $90\%$ decrease approximately. Further personalization takes another 5-8 hours per identity. Right: Corresponding visual quality of generated videos with challenging novel expressions, measured by PSNR (higher the better). Increasing the number of identities improves the robustness of our model to unseen expressions.
  • Figure 5: Lip Synced Video Synthesis. Qualitative comparison of our method with state-of-the-art approaches, GAN-based Wav2Lip wav2lip, AD-NeRF adnerf and Lip-NeRF lipnerf. The original video is in English (1st column). The generated videos (columns 2-5) are lip synced to dubbed audio in Spanish.
  • ...and 10 more figures

Theorems & Definitions (5)

  • Proposition 1
  • Proposition 2
  • proof
  • Proposition 3
  • proof