Table of Contents
Fetching ...

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Steven Hogue, Chenxu Zhang, Hamza Daruger, Yapeng Tian, Xiaohu Guo

TL;DR

DiffTED is introduced, a new approach for one-shot audio-driven TED-style talking video generation from a single image that utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers.

Abstract

Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.

DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

TL;DR

DiffTED is introduced, a new approach for one-shot audio-driven TED-style talking video generation from a single image that utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers.

Abstract

Audio-driven talking video generation has advanced significantly, but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately, leading to less coherent outputs. Furthermore, the gestures produced by these methods often appear overly smooth or subdued, lacking in diversity, and many gesture-centric approaches do not integrate talking head generation. To address these limitations, we introduce DiffTED, a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically, we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model, precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance, empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.
Paper Structure (13 sections, 7 equations, 7 figures, 3 tables)

This paper contains 13 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Overview of the proposed pipeline: DiffTED. Given a source image and driving audio as input, we generate a gesture sequence, $x_0$, represented by TPS keypoints using the diffusion model. This sequence of TPS keypoints then serves as input into the video renderer to transform the source image and produce the final talking video with co-speech gestures.
  • Figure 2: Qualitative results of the DiffTED pipeline. Five frames chosen from a sequence to show the diversity of gestures. The wide range of motion can be seen in the arms and the body positioning of the speaker, as well as in the direction the speaker is looking. In sequence (a) we can see movement in both hands as well as the face and body turning to look in a different direction. Sequence (b) is the same as (a) but with keypoints added.
  • Figure 3: Failure case of the Speech2Gesture-based network where the arm, highlighted in blue, grows throughout the sequence in (a). Where in the diffusion network, the relative arm length in the sequence stays the same size as shown in (b).
  • Figure 4: The EAMM-based method suffers from jittering effects in the generated gestures. (a) show 4 subsequent frames that have a quick jitter seen in the hand, highlighted in red. The hand moves from the initial position in the first frame to a raised position in second, back to the initial position in third, and then lower in the fourth. A smoother and more gradual transition between poses is expected as seen in the sequence of (b), which is generated by our diffusion-based method.
  • Figure 5: Qualitative example of ablation on diffusion on position (a)(c), and diffusion on noise (b)(d). In (a), the outstretched arm has an unnatural bend to it, while in (b) the arm is straight. Image (c) shows another example of an unnatural bend in the arm, where in (d) the arm is straight as expected.
  • ...and 2 more figures