Table of Contents
Fetching ...

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent

TL;DR

A novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments.

Abstract

Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

TL;DR

A novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments.

Abstract

Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.
Paper Structure (25 sections, 1 equation, 5 figures, 5 tables)

This paper contains 25 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Diagram of SA-ASR kanda2021end.
  • Figure 2: Pipeline of the proposed system.
  • Figure 3: Preparation of the AMI corpus for the training, development, and test sets.
  • Figure 4: Histograms of speaker overlaps on the AMI corpus. Only clips shorter than 10 s are shown.
  • Figure 5: Similarity matrices between the speaker embeddings of the candidate segments in meeting ES2004c. With the three specified segment lengths, the number of candidate segments is 154, 81, and 96, and the resulting SER for that meeting is 19.88%, 19.65%, and 19.62%, respectively.