Table of Contents
Fetching ...

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

Suzhen Wang, Lincheng Li, Yu Ding, Changjie Fan, Xin Yu

TL;DR

We address the challenge of generating photo-realistic talking-head videos from a single image by decoupling natural head motion from detailed facial expressions. The method introduces a 6-DoF head pose predictor driven by audio via a motion-aware RNN and a dense, keypoint-based motion field to govern full-frame motion, followed by an image renderer to synthesize frames. Key contributions include the motion-field-based representation, a two-stage training scheme for stable motion and high fidelity, and extensive evaluations showing state-of-the-art visual quality and rhythmic head motion across unseen identities. The approach enables robust, controllable one-shot video synthesis with practical applicability while highlighting considerations for ethical use and detection of synthetic media.

Abstract

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

Audio2Head: Audio-driven One-shot Talking-head Generation with Natural Head Motion

TL;DR

We address the challenge of generating photo-realistic talking-head videos from a single image by decoupling natural head motion from detailed facial expressions. The method introduces a 6-DoF head pose predictor driven by audio via a motion-aware RNN and a dense, keypoint-based motion field to govern full-frame motion, followed by an image renderer to synthesize frames. Key contributions include the motion-field-based representation, a two-stage training scheme for stable motion and high fidelity, and extensive evaluations showing state-of-the-art visual quality and rhythmic head motion across unseen identities. The approach enables robust, controllable one-shot video synthesis with practical applicability while highlighting considerations for ethical use and detection of synthetic media.

Abstract

We propose an audio-driven talking-head method to generate photo-realistic talking-head videos from a single reference image. In this work, we tackle two key challenges: (i) producing natural head motions that match speech prosody, and (ii) maintaining the appearance of a speaker in a large head motion while stabilizing the non-face regions. We first design a head pose predictor by modeling rigid 6D head movements with a motion-aware recurrent neural network (RNN). In this way, the predicted head poses act as the low-frequency holistic movements of a talking head, thus allowing our latter network to focus on detailed facial movement generation. To depict the entire image motions arising from audio, we exploit a keypoint based dense motion field representation. Then, we develop a motion field generator to produce the dense motion fields from input audio, head poses, and a reference image. As this keypoint based representation models the motions of facial regions, head, and backgrounds integrally, our method can better constrain the spatial and temporal consistency of the generated videos. Finally, an image generation network is employed to render photo-realistic talking-head videos from the estimated keypoint based motion fields and the input reference image. Extensive experiments demonstrate that our method produces videos with plausible head motions, synchronized facial expressions, and stable backgrounds and outperforms the state-of-the-art.

Paper Structure

This paper contains 22 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Illustration of the proposed audio-driven single image based talking-head video generation method. First row: the input reference image and audio, and the predicted head pose; middle row: generated motion fields from the audio and image; bottom row: synthesized talking-head frames.
  • Figure 2: Pipeline of the proposed framework.
  • Figure 3: Architecture of the head motion predictor.
  • Figure 4: Architecture of the motion field generator.
  • Figure 5: Comparison with the state-of-the-art. Please see more dynamic demos in our supplementary materials.
  • ...and 5 more figures