Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Can Cui; Imran Ahamad Sheikh; Mostafa Sadeghi; Emmanuel Vincent

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Can Cui, Imran Ahamad Sheikh, Mostafa Sadeghi, Emmanuel Vincent

TL;DR

A novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments.

Abstract

Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

TL;DR

A novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments.

Abstract

Paper Structure (25 sections, 1 equation, 5 figures, 5 tables)

This paper contains 25 sections, 1 equation, 5 figures, 5 tables.

Introduction
Background
Voice Activity Detection: CRDNN
Speaker Diarization: ECAPA-TDNN
End-to-end Speaker-Attributed ASR
AMI segmentation
Based on fixed-sized chunks
Based on ground-truth silence positions
Proposed methods
Overall system
Data preparation
Preparation of training, development, and test sets
Remapping of speaker IDs on the test set
Experimental settings
Dataset and metrics
...and 10 more sections

Figures (5)

Figure 1: Diagram of SA-ASR kanda2021end.
Figure 2: Pipeline of the proposed system.
Figure 3: Preparation of the AMI corpus for the training, development, and test sets.
Figure 4: Histograms of speaker overlaps on the AMI corpus. Only clips shorter than 10 s are shown.
Figure 5: Similarity matrices between the speaker embeddings of the candidate segments in meeting ES2004c. With the three specified segment lengths, the number of candidate segments is 154, 81, and 96, and the resulting SER for that meeting is 19.88%, 19.65%, and 19.62%, respectively.

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

TL;DR

Abstract

Improving Speaker Assignment in Speaker-Attributed ASR for Real Meeting Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (5)