Table of Contents
Fetching ...

You said that?

Joon Son Chung, Amir Jamaludin, Andrew Zisserman

TL;DR

Speech2Vid addresses the challenge of producing lip-synced talking-face videos from a single identity image and an audio clip. It uses a two-stream CNN to learn a joint audio-visual embedding that generates video frames in real time, trained on tens of hours of unlabelled video data. The approach supports unseen identities and speech and includes a dedicated deblurring module to sharpen outputs. Key findings show the importance of identity-preserving skip connections and benefit from multiple identity images, with a practical lip re-dubbing workflow enabled by alignment and Poisson blending.

Abstract

We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on tens of hours of unlabelled videos. We also show results of re-dubbing videos using speech from a different person.

You said that?

TL;DR

Speech2Vid addresses the challenge of producing lip-synced talking-face videos from a single identity image and an audio clip. It uses a two-stream CNN to learn a joint audio-visual embedding that generates video frames in real time, trained on tens of hours of unlabelled video data. The approach supports unseen identities and speech and includes a dedicated deblurring module to sharpen outputs. Key findings show the importance of identity-preserving skip connections and benefit from multiple identity images, with a practical lip re-dubbing workflow enabled by alignment and Poisson blending.

Abstract

We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on tens of hours of unlabelled videos. We also show results of re-dubbing videos using speech from a different person.

Paper Structure

This paper contains 12 sections, 1 equation, 12 figures, 1 table.

Figures (12)

  • Figure 1: The Speech2Vid model generates a video of a talking face, given still images of the person and a speech segment. The model takes an image of the target face and an audio segment, and outputs a video of the target face lip synched with the audio. Note that the target face need not be in the training dataset i.e. the Speech2Vid is applicable to unseen images and speech.
  • Figure 2: Data preparation pipeline.
  • Figure 3: Left pair: Face images before registration; Middle: Canonical face; Right pair: Face images after registration with the canonical face.
  • Figure 4: The overall Speech2Vid model is a combination of two encoders taking in two different streams of data, audio and image, a decoder that generates images corresponding to the audio, and a CNN deblurring module that refine the output frames.
  • Figure 5: Inputs to the Vid2Speech model. Left: MFCC heatmap for the 0.35-second time period. The 12 rows in the matrix represent the power of the audio at different frequencies. Right: Still image of the speaker.
  • ...and 7 more figures