Table of Contents
Fetching ...

Memories are One-to-Many Mapping Alleviators in Talking Face Generation

Anni Tang, Tianyu He, Xu Tan, Jun Ling, Li Song

TL;DR

This paper proposes MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively that are employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space.

Abstract

Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. Due to its nature of one-to-many mapping from the input audio to the output video (e.g., one speech content may have multiple feasible visual appearances), learning a deterministic mapping like previous works brings ambiguity during training, and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.

Memories are One-to-Many Mapping Alleviators in Talking Face Generation

TL;DR

This paper proposes MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively that are employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space.

Abstract

Talking face generation aims at generating photo-realistic video portraits of a target person driven by input audio. Due to its nature of one-to-many mapping from the input audio to the output video (e.g., one speech content may have multiple feasible visual appearances), learning a deterministic mapping like previous works brings ambiguity during training, and thus causes inferior visual results. Although this one-to-many mapping could be alleviated in part by a two-stage framework (i.e., an audio-to-expression model followed by a neural-rendering model), it is still insufficient since the prediction is produced without enough information (e.g., emotions, wrinkles, etc.). In this paper, we propose MemFace to complement the missing information with an implicit memory and an explicit memory that follow the sense of the two stages respectively. More specifically, the implicit memory is employed in the audio-to-expression model to capture high-level semantics in the audio-expression shared space, while the explicit memory is employed in the neural-rendering model to help synthesize pixel-level details. Our experimental results show that our proposed MemFace surpasses all the state-of-the-art results across multiple scenarios consistently and significantly.
Paper Structure (50 sections, 6 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 50 sections, 6 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Example frames of real video (1st row), NVP thies2020neural (2nd row) and the proposed MemFace (3rd row) for saying 'better day to come'. Compared to the baseline NVP thies2020neural, our results demonstrate higher lip-sync quality and more realistic rendering results by alleviating the one-to-many mapping problem with memories.
  • Figure 2: The overview of MemFace. To alleviate the one-to-many mapping difficulty, we propose to complement the missing information with memories. The implicit memory is introduced to the audio-to-expression model to complement the semantically-aligned information (see Fig. \ref{['fig:a2e_model']}), while the explicit memory is introduced to the neural-rendering model to retrieve the personalized visual details (see Fig. \ref{['fig:rendering_model']}).
  • Figure 3: Our implicit memory is introduced between the encoder and decoder of the audio-to-expression model. The keys and values of the memory are both jointly learned with the audio-to-expression model.
  • Figure 4: We construct the explicit memory by the vertices (i.e., the mouth shapes) and corresponding image patches, which can be retrieved by the neural-rendering model to complement the personalized visual details.
  • Figure 5: Qualitative comparison with the state-of-the-art methods (NVP thies2020neural, MemGAN yi2020audio, ADNeRFguo2021ad and DFRFshen2022learning) on Obama dataset. The blue and green arrows indicate inferior lip-sync and rendering quality respectively. It shows that our MemFace achieves higher lip-sync and rendering quality.
  • ...and 5 more figures