Table of Contents
Fetching ...

KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation

Hoang-Son Vo-Thanh, Quang-Vinh Nguyen, Soo-Hyung Kim

TL;DR

This paper proposes the KFusion of Dual-Domain model, a robust model that generates landmarks from audio that separates the audio into two distinct domains to learn emotional information and facial context, then uses a fusion mechanism based on the KAN model.

Abstract

Audio-driven talking face generation is a widely researched topic due to its high applicability. Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Early studies often focused solely on changing the mouth movements, which resulted in outcomes with limited practical applications. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. To achieve this, they need to generate through landmarks. However, creating stable landmarks that align well with the audio is a challenge. In this paper, we propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio. We separate the audio into two distinct domains to learn emotional information and facial context, then use a fusion mechanism based on the KAN model. Our model demonstrates high efficiency compared to recent models. This will lay the groundwork for the development of the audio-driven talking face generation problem in the future.

KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation

TL;DR

This paper proposes the KFusion of Dual-Domain model, a robust model that generates landmarks from audio that separates the audio into two distinct domains to learn emotional information and facial context, then uses a fusion mechanism based on the KAN model.

Abstract

Audio-driven talking face generation is a widely researched topic due to its high applicability. Reconstructing a talking face using audio significantly contributes to fields such as education, healthcare, online conversations, virtual assistants, and virtual reality. Early studies often focused solely on changing the mouth movements, which resulted in outcomes with limited practical applications. Recently, researchers have proposed a new approach of constructing the entire face, including face pose, neck, and shoulders. To achieve this, they need to generate through landmarks. However, creating stable landmarks that align well with the audio is a challenge. In this paper, we propose the KFusion of Dual-Domain model, a robust model that generates landmarks from audio. We separate the audio into two distinct domains to learn emotional information and facial context, then use a fusion mechanism based on the KAN model. Our model demonstrates high efficiency compared to recent models. This will lay the groundwork for the development of the audio-driven talking face generation problem in the future.
Paper Structure (24 sections, 7 equations, 4 figures, 2 tables)

This paper contains 24 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Overview of the problem we will address in this paper. We extract landmarks (in blue) from the ground truth video. Our model has learned to generate a sequence of landmarks (in red) from the audio.
  • Figure 2: Overview of our model architecture, which consists of three distinct parts: Global Domain (blue background), Context Domain (yellow background), and KFusion (green background). The outputs of the two domains are features with dimensions $B \times C \times F$ or $B \times C \times M$, where $F$ represents features for the entire face, and $M$ represents features for the mouth region. The output of the entire model is a sequence of landmarks.
  • Figure 3: MEAD: The dataset we use includes video and accompanying audio. "M" denotes male and "W" denotes female. There are a total of 8 emotions in the conversations. Additionally, we also extract landmarks from the video to serve as ground truth for our problem.
  • Figure 4: Quantitative comparison with other methods on the MEAD dataset for samples M003 and M030. The red landmarks are the target. The last row shows the results of the proposed method. In the third row, the model by Wang et al wang2020mead. has red squares indicating inaccuracies in mouth movements and red arrows pointing to incorrect facial orientations. The fourth row shows similar issues to the model by Ji et al ji2021audio.