Table of Contents
Fetching ...

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

Jun Ling, Yiwen Wang, Han Xue, Rong Xie, Li Song

TL;DR

A Pose Latent Diffusion model is devised to generate motion latent from text prompts and audio cues in a pose latent space to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet.

Abstract

While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.

PoseTalk: Text-and-Audio-based Pose Control and Motion Refinement for One-Shot Talking Head Generation

TL;DR

A Pose Latent Diffusion model is devised to generate motion latent from text prompts and audio cues in a pose latent space to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet.

Abstract

While previous audio-driven talking head generation (THG) methods generate head poses from driving audio, the generated poses or lips cannot match the audio well or are not editable. In this study, we propose \textbf{PoseTalk}, a THG system that can freely generate lip-synchronized talking head videos with free head poses conditioned on text prompts and audio. The core insight of our method is using head pose to connect visual, linguistic, and audio signals. First, we propose to generate poses from both audio and text prompts, where the audio offers short-term variations and rhythm correspondence of the head movements and the text prompts describe the long-term semantics of head motions. To achieve this goal, we devise a Pose Latent Diffusion (PLD) model to generate motion latent from text prompts and audio cues in a pose latent space. Second, we observe a loss-imbalance problem: the loss for the lip region contributes less than 4\% of the total reconstruction loss caused by both pose and lip, making optimization lean towards head movements rather than lip shapes. To address this issue, we propose a refinement-based learning strategy to synthesize natural talking videos using two cascaded networks, i.e., CoarseNet, and RefineNet. The CoarseNet estimates coarse motions to produce animated images in novel poses and the RefineNet focuses on learning finer lip motions by progressively estimating lip motions from low-to-high resolutions, yielding improved lip-synchronization performance. Experiments demonstrate our pose prediction strategy achieves better pose diversity and realness compared to text-only or audio-only, and our video generator model outperforms state-of-the-art methods in synthesizing talking videos with natural head motions. Project: https://junleen.github.io/projects/posetalk.
Paper Structure (23 sections, 10 equations, 15 figures, 5 tables)

This paper contains 23 sections, 10 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Key features of our PoseTalk. Our method can synthesize talking face videos from an image, the driving audio, and driving poses (In the following sections, we use "pose" as the abbreviation of "pose/gaze" for readability). The driving poses might be fixed, reference poses (from other talking videos or predicted from audio), or generated poses based on text prompts and audio. Our approach additionally supports the generation of diverse poses using different text prompts. Due to the page constraints, we present these results in our supplementary materials and demonstration videos.
  • Figure 2: Dataset construction pipeline. We adopt off-the-shelf vision models and audio-pertaining models to extract motion-related representations and audio features, respectively. Then, for the text prompts that describe head movements from the CelebV-Text dataset, we adopt the text encoder from CLIP radford2021learning to obtain semantic-aligned text embeddings.
  • Figure 3: The overview of our pose diffusion and talking face video generation. (a) During training, the pose latent diffusion model is conditioned on the pose embedding learned by VAE. The denoising process is conditioned on the text embedding, time stamps, and audio features. (b) Given a source image, the audio features, and the extracted or predicted pose/gaze features, the video generator gradually estimates finer motions and lip-synced talking videos.
  • Figure 4: Refinement-based Video Generator. (a) We employ a CoarseNet to estimate coarse motions based on the source frame and input conditions. Motion2Latent module (see c) is used to map the motion $m_{1:T}$ to latent space. In (b), we utilize an image encoder to extract features from the warped frames and develop an effective motion refinement decoder to progressively output refiner lip motions from low-to-high resolutions. (d) The structures of blocks in Motion Decoder of RefineNet.
  • Figure 5: Qualitative comparisons with state-of-the-art methods on HDTF and MEAD. We obtain the results under the one-shot audio-driven generation settings.
  • ...and 10 more figures