Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition
Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang
TL;DR
This paper tackles the slow speed of NeRF-based talking portrait synthesis by introducing Real-time Audio-spatial Decomposed NeRF (RAD-NeRF). It decomposes high-dimensional audio-spatial information into two low-dimensional grids for the head and a lightweight 2D pseudo-3D deformable module for the torso, enabling real-time inference while maintaining rendering quality. Key contributions include the Decomposed Audio-spatial Encoding Module, the Pseudo-3D Deformable Module, maximum occupancy grid pruning, and targeted losses for lips and eye control; the method achieves about $40$ FPS and substantially faster convergence than prior work, with strong quantitative and qualitative results and explicit control over head pose, eyes, and background. This approach holds practical impact for telepresence, digital humans, and multimedia applications, while acknowledging ethical considerations and providing insights for deepfake detection and mitigation. In summary, the work advances efficient dynamic NeRF for audio-driven portraits by combining a decomposed audio-spatial representation with a lightweight torso model, delivering real-time, high-quality, controllable talking portraits.
Abstract
While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage. In this paper, we propose an efficient NeRF-based framework that enables real-time synthesizing of talking portraits and faster convergence by leveraging the recent success of grid-based NeRF. Our key insight is to decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids. Specifically, a Decomposed Audio-spatial Encoding Module models the dynamic head with a 3D spatial grid and a 2D audio grid. The torso is handled with another 2D grid in a lightweight Pseudo-3D Deformable Module. Both modules focus on efficiency under the premise of good rendering quality. Extensive experiments demonstrate that our method can generate realistic and audio-lips synchronized talking portrait videos, while also being highly efficient compared to previous methods.
