Table of Contents
Fetching ...

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang

TL;DR

This paper tackles the slow speed of NeRF-based talking portrait synthesis by introducing Real-time Audio-spatial Decomposed NeRF (RAD-NeRF). It decomposes high-dimensional audio-spatial information into two low-dimensional grids for the head and a lightweight 2D pseudo-3D deformable module for the torso, enabling real-time inference while maintaining rendering quality. Key contributions include the Decomposed Audio-spatial Encoding Module, the Pseudo-3D Deformable Module, maximum occupancy grid pruning, and targeted losses for lips and eye control; the method achieves about $40$ FPS and substantially faster convergence than prior work, with strong quantitative and qualitative results and explicit control over head pose, eyes, and background. This approach holds practical impact for telepresence, digital humans, and multimedia applications, while acknowledging ethical considerations and providing insights for deepfake detection and mitigation. In summary, the work advances efficient dynamic NeRF for audio-driven portraits by combining a decomposed audio-spatial representation with a lightweight torso model, delivering real-time, high-quality, controllable talking portraits.

Abstract

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

TL;DR

This paper tackles the slow speed of NeRF-based talking portrait synthesis by introducing Real-time Audio-spatial Decomposed NeRF (RAD-NeRF). It decomposes high-dimensional audio-spatial information into two low-dimensional grids for the head and a lightweight 2D pseudo-3D deformable module for the torso, enabling real-time inference while maintaining rendering quality. Key contributions include the Decomposed Audio-spatial Encoding Module, the Pseudo-3D Deformable Module, maximum occupancy grid pruning, and targeted losses for lips and eye control; the method achieves about FPS and substantially faster convergence than prior work, with strong quantitative and qualitative results and explicit control over head pose, eyes, and background. This approach holds practical impact for telepresence, digital humans, and multimedia applications, while acknowledging ethical considerations and providing insights for deepfake detection and mitigation. In summary, the work advances efficient dynamic NeRF for audio-driven portraits by combining a decomposed audio-spatial representation with a lightweight torso model, delivering real-time, high-quality, controllable talking portraits.

Abstract

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.
Paper Structure (20 sections, 7 equations, 8 figures, 6 tables)

This paper contains 20 sections, 7 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Network Architecture. The head is modeled with the Audio-spatial Decomposed Encoding Module. Input audio signal is first processed with the Audio Feature Extractor (AFE) guo2021ad, then compressed to a low-dimensional spatial-dependent audio coordinate $\mathbf{x}_a$. Two decomposed grid encoders $E^3_\text{spatial}, E^2_\text{audio}$ separately encode the spatial coordinate $\mathbf{x}$ and audio coordinate $\mathbf{x}_a$. The spatial features $\mathbf{f}$ and audio features $\mathbf{g}$ are fused in an MLP to produce head color $\mathbf{c}$ and density $\sigma$ for volume rendering. The torso is modeled with the Pseudo-3D Deformable Module. We only sample one torso coordinate $\mathbf{x}_t$ per pixel, and learn a deformation field to model the torso dynamics dependent on head pose $\mathbf{p}$. Another grid encoder $E^2_\text{torso}$ learns the torso features $\mathbf{f}_t$, which are fed to an MLP to get torso color $\mathbf{c}_t$ and alpha $\alpha_t$.
  • Figure 2: An example of the landmark information. Based on the predicted 2D facial landmarks, we extract three features to assist training: the face region $\mathcal{I}_\text{face}$ for dynamic regularization, the eye ratio $e$ for eye control, and the lips patch $\mathcal{P}$ for lips fine-tuning.
  • Figure 3: Cross-driven quality comparison. We show visualizations of representative methods under the cross-driven setting. Yellow boxes denote low image quality, and red boxes denote inaccurate lips. Our methods generate both good image quality and accurate lips movement. We recommend watching the supplementary video for better details.
  • Figure 4: Self-driven quality comparison. We compare against the ground truth images under the self-driven setting. Our method reconstructs sharper and more accurate lips compared to previous works.
  • Figure 5: Explicit control of talking portrait synthesis. Apart from lips, our method also supports explicit control of eyes, head poses, and background images.
  • ...and 3 more figures