Table of Contents
Fetching ...

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

Dongze Li, Kang Zhao, Wei Wang, Yifeng Ma, Bo Peng, Yingya Zhang, Jing Dong

TL;DR

S$^{3}$D-NeRF addresses the challenge of high-fidelity, audio-driven talking head synthesis from a single image without identity-specific retraining. It combines a Hierarchical Facial Appearance Encoder to capture multi-scale identity features, a Cross-modal Facial Deformation Field to map speech to region-aware facial motion via cross-attention, and a lip-sync discriminator to enforce tight audio-visual synchronization, all within a coarse-to-fine NeRF rendering framework augmented by a super-resolution module. The approach demonstrates state-of-the-art fidelity and lip synchronization while generalizing to unseen identities, validated on HDTF and MEAD datasets with comprehensive ablations. This work enables robust, view-controllable talking head synthesis from minimal input, with practical impact for digital humans and related applications, while acknowledging ethical considerations and limitations such as background changes and pose-induced artifacts. S$^{3}$D-NeRF thus advances one-shot audio-driven NeRF-based portrait generation by tightly integrating appearance modeling, deformation control, and temporal consistency cues into a unified pipeline.

Abstract

Talking head synthesis is a practical technique with wide applications. Current Neural Radiance Field (NeRF) based approaches have shown their superiority on driving one-shot talking heads with videos or signals regressed from audio. However, most of them failed to take the audio as driven information directly, unable to enjoy the flexibility and availability of speech. Since mapping audio signals to face deformation is non-trivial, we design a Single-Shot Speech-Driven Neural Radiance Field (S^3D-NeRF) method in this paper to tackle the following three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. To this end, we introduce a Hierarchical Facial Appearance Encoder to learn multi-scale representations for catching the appearance of different speakers, and elaborate a Cross-modal Facial Deformation Field to perform speech animation according to the relationship between the audio signal and different face regions. Moreover, to enhance the temporal consistency of the important lip area, we introduce a lip-sync discriminator to penalize the out-of-sync audio-visual sequences. Extensive experiments have shown that our S^3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.

S^3D-NeRF: Single-Shot Speech-Driven Neural Radiance Field for High Fidelity Talking Head Synthesis

TL;DR

SD-NeRF addresses the challenge of high-fidelity, audio-driven talking head synthesis from a single image without identity-specific retraining. It combines a Hierarchical Facial Appearance Encoder to capture multi-scale identity features, a Cross-modal Facial Deformation Field to map speech to region-aware facial motion via cross-attention, and a lip-sync discriminator to enforce tight audio-visual synchronization, all within a coarse-to-fine NeRF rendering framework augmented by a super-resolution module. The approach demonstrates state-of-the-art fidelity and lip synchronization while generalizing to unseen identities, validated on HDTF and MEAD datasets with comprehensive ablations. This work enables robust, view-controllable talking head synthesis from minimal input, with practical impact for digital humans and related applications, while acknowledging ethical considerations and limitations such as background changes and pose-induced artifacts. SD-NeRF thus advances one-shot audio-driven NeRF-based portrait generation by tightly integrating appearance modeling, deformation control, and temporal consistency cues into a unified pipeline.

Abstract

Talking head synthesis is a practical technique with wide applications. Current Neural Radiance Field (NeRF) based approaches have shown their superiority on driving one-shot talking heads with videos or signals regressed from audio. However, most of them failed to take the audio as driven information directly, unable to enjoy the flexibility and availability of speech. Since mapping audio signals to face deformation is non-trivial, we design a Single-Shot Speech-Driven Neural Radiance Field (S^3D-NeRF) method in this paper to tackle the following three difficulties: learning a representative appearance feature for each identity, modeling motion of different face regions with audio, and keeping the temporal consistency of the lip area. To this end, we introduce a Hierarchical Facial Appearance Encoder to learn multi-scale representations for catching the appearance of different speakers, and elaborate a Cross-modal Facial Deformation Field to perform speech animation according to the relationship between the audio signal and different face regions. Moreover, to enhance the temporal consistency of the important lip area, we introduce a lip-sync discriminator to penalize the out-of-sync audio-visual sequences. Extensive experiments have shown that our S^3D-NeRF surpasses previous arts on both video fidelity and audio-lip synchronization.
Paper Structure (17 sections, 9 equations, 6 figures, 3 tables)

This paper contains 17 sections, 9 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A showcase of our S$^{3}$D-NeRF, which generates high quality face portraits with fine-grained face texture and mouth details .
  • Figure 2: The full pipeline of our S$^{3}$D-NeRF. The Hierarchical Facial Appearance Encoder extracts representative features from the masked face region of the single-shot source image, for high fidelity neural rendering of an arbitrary identity. The Cross-modal Facial Deformation Field accurately models the motion of different face regions, with the help of the correlation score calculated through cross attention between audio-visual features. Texture details are complemented with the super-resolution module.
  • Figure 3: Results with naive deformation module (left) and our Cross-modal Facial Deformation Field (right). Lower Face regions have the largest activations in the heatmap, which denote their strongest correlations with the driven speech signal.
  • Figure 4: Qualitative comparison with single-shot methods. Our S$^{3}$D-NeRF yields the most correct lip shapes and the clearest teeth. Note that different methods adopt different face alignment tools, and the ground truth row demonstrates the raw image without alignment, so the face poses from different methods are slightly different.
  • Figure 5: Left (a): Qualitative comparison with NeRF-based methods, when encountering a new identity, our S$^{3}$D-NeRF successfully synthesizes a portrait faithfully without any retraining. Right (b): High fidelity generation results with multi-view consistency. Ground truth are the images with front view. Most of the face details are preserved.
  • ...and 1 more figures