Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

Alexander Blatt; Aravind Krishnan; Dietrich Klakow

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

Alexander Blatt, Aravind Krishnan, Dietrich Klakow

TL;DR

This study presents a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture and compares this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets.

Abstract

Utilizing air-traffic control (ATC) data for downstream natural-language processing tasks requires preprocessing steps. Key steps are the transcription of the data via automatic speech recognition (ASR) and speaker diarization, respectively speaker role detection (SRD) to divide the transcripts into pilot and air-traffic controller (ATCO) transcripts. While traditional approaches take on these tasks separately, we propose a transformer-based joint ASR-SRD system that solves both tasks jointly while relying on a standard ASR architecture. We compare this joint system against two cascaded approaches for ASR and SRD on multiple ATC datasets. Our study shows in which cases our joint system can outperform the two traditional approaches and in which cases the other architectures are preferable. We additionally evaluate how acoustic and lexical differences influence all architectures and show how to overcome them for our joint architecture.

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

TL;DR

Abstract

Paper Structure (13 sections, 1 equation, 4 figures, 3 tables)

This paper contains 13 sections, 1 equation, 4 figures, 3 tables.

Introduction
Related work
Datasets
ASR&SRD architectures
SRD-ASR
ASR-SRD
Joint
Experimental setup
Results
Inter-and intra-dataset evaluation
Relation and causation analysis for ASR&SRD
Few-shot learning
Conclusion

Figures (4)

Figure 1: Dataset dependent distributions
Figure 2: ASR&SRD architectures; left: acoustic SRD followed by ASR (SRD-ASR); center: Joint ASR&SRD (Joint); right: ASR followed by linguistic-based SRD (ASR-SRD)
Figure 3: Confusion matrices for different metrics (a)-(l) and different ASR&SRD methods (d)-(l) run with the xlsr model. The columns correspond to the test datasets and the rows to the training dataset. The SNR train/test ratio is calculated based on the values of \ref{['tab:splits']}. The datasets are abbreviated as follows: AT: ATCO2, LD: LDC-ATCC, Li: LiveATC.
Figure 4: Few-shot learning on LDC-ATCC of a Joint-xlsr model finetuned previously on Live ATC data. All experiments are just conducted once.

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

TL;DR

Abstract

Joint vs Sequential Speaker-Role Detection and Automatic Speech Recognition for Air-traffic Control

Authors

TL;DR

Abstract

Table of Contents

Figures (4)