Table of Contents
Fetching ...

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Mao-Kui He, Jun Du, Shu-Tong Niu, Qing-Feng Liu, Chin-Hui Lee

TL;DR

A quality-aware end-to-end audio-visual neural speaker diarization framework that is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information.

Abstract

In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems.

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

TL;DR

A quality-aware end-to-end audio-visual neural speaker diarization framework that is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information.

Abstract

In this paper, we propose a quality-aware end-to-end audio-visual neural speaker diarization framework, which comprises three key techniques. First, our audio-visual model takes both audio and visual features as inputs, utilizing a series of binary classification output layers to simultaneously identify the activities of all speakers. This end-to-end framework is meticulously designed to effectively handle situations of overlapping speech, providing accurate discrimination between speech and non-speech segments through the utilization of multi-modal information. Next, we employ a quality-aware audio-visual fusion structure to address signal quality issues for both audio degradations, such as noise, reverberation and other distortions, and video degradations, such as occlusions, off-screen speakers, or unreliable detection. Finally, a cross attention mechanism applied to multi-speaker embedding empowers the network to handle scenarios with varying numbers of speakers. Our experimental results, obtained from various data sets, demonstrate the robustness of our proposed techniques in diverse acoustic environments. Even in scenarios with severely degraded video quality, our system attains performance levels comparable to the best available audio-visual systems.

Paper Structure

This paper contains 22 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The illustration of network structure
  • Figure 2: The four fusion strategies of audio embedding (A), video embedding (V) and speaker embedding (I). IA are concatenation of audio and speaker embedding.
  • Figure 3: Illustration of the cross-speaker attention layer. 'MHA' stands for multi-head attention.
  • Figure 4: An example showcasing issues of track errors and missing facial detections in an audio-visual recording.
  • Figure 5: A DER comparison of different lip miss rates with four fusion strategies on the AMI EVAL set.
  • ...and 1 more figures