Table of Contents
Fetching ...

Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

Pegah Salehi, Sajad Amouei Sheshkal, Vajira Thambawita, Sushant Gautam, Saeed S. Sabet, Dag Johansen, Michael A. Riegler, Pål Halvorsen

TL;DR

Evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions.

Abstract

This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.

Comparative Analysis of Audio Feature Extraction for Real-Time Talking Portrait Synthesis

TL;DR

Evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions.

Abstract

This paper examines the integration of real-time talking-head generation for interviewer training, focusing on overcoming challenges in Audio Feature Extraction (AFE), which often introduces latency and limits responsiveness in real-time applications. To address these issues, we propose and implement a fully integrated system that replaces conventional AFE models with Open AI's Whisper, leveraging its encoder to optimize processing and improve overall system efficiency. Our evaluation of two open-source real-time models across three different datasets shows that Whisper not only accelerates processing but also improves specific aspects of rendering quality, resulting in more realistic and responsive talking-head interactions. These advancements make the system a more effective tool for immersive, interactive training applications, expanding the potential of AI-driven avatars in interviewer training.

Paper Structure

This paper contains 22 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: (a) System architecture of the interactive child avatar, detailing the integration of key modules: (1) Listening, (2) STT, (3) Language Processing, (4) TTS, (5) AFE, (6) Frames Rendering, and (7) Audio Overlay. This setup simulates natural conversation, allowing the user to interact with the avatar as if communicating with a real person. (b) User interaction with the child avatar system.
  • Figure 2: Execution time comparison of open-source real-time talking-head generation models, including RAD-NeRF tang2022real, ER-NeRF li2023efficient, Gaussian Talker cho2024gaussiantalker and GeneFace++ ye2023geneface++. The solid lines represent execution times excluding AFE, while the dashed lines indicate execution times that include AFE.
  • Figure 3: Execution time comparison of different AFE models, including Deep-Speech amodei2016deep, Wav2Vec baevski2020wav2vec, HuBERT hsu2021hubert, and Whisper radford2023robust.
  • Figure 4: Execution time comparison of RAD-NeRF tang2022real and ER-NeRF li2023efficient across different AFE models.
  • Figure 5: Quality comparison: Examples of visualizations of RAD-NeRF tang2022real under the self-driven setting, based on two frames extracted from each video illustrating typical challenges. Yellow boxes highlight areas of noisy image quality, while red boxes indicate regions with inaccurate lip synchronization.
  • ...and 1 more figures