Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

Alexander Blatt; Dietrich Klakow

Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

Alexander Blatt, Dietrich Klakow

TL;DR

The paper tackles robustness of call-sign recognition and understanding (CRU) in ATC under edge-case conditions such as high word error rate, clipping, and missing transcripts. It introduces CallSBERT, a compact SBERT-based CRU model, and CCR, a multimodal framework that integrates CallSBERT with a coordinate-driven command distribution module to exploit surveillance context from ADS-B data. Through data preparation on MALORCA and AIRBUS with varied edge-case augmentations and CDM optimizations, the study demonstrates up to 15% edge-case performance gains and improved stability across the operational range, with CCR providing resilience when transcripts are degraded or unavailable. The approach offers a scalable, efficient path to robust ATC speech processing and could generalize to other domains where spatial command context is available.

Abstract

Operational machine-learning based assistant systems must be robust in a wide range of scenarios. This hold especially true for the air-traffic control (ATC) domain. The robustness of an architecture is particularly evident in edge cases, such as high word error rate (WER) transcripts resulting from noisy ATC recordings or partial transcripts due to clipped recordings. To increase the edge-case robustness of call-sign recognition and understanding (CRU), a core tasks in ATC speech processing, we propose the multimodal call-sign-command recovery model (CCR). The CCR architecture leads to an increase in the edge case performance of up to 15%. We demonstrate this on our second proposed architecture, CallSBERT. A CRU model that has less parameters, can be fine-tuned noticeably faster and is more robust during fine-tuning than the state of the art for CRU. Furthermore, we demonstrate that optimizing for edge cases leads to a significantly higher accuracy across a wide operational range.

Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

TL;DR

Abstract

Paper Structure (15 sections, 7 figures, 1 table)

This paper contains 15 sections, 7 figures, 1 table.

Introduction
Related work
Data preparation
Models
EncDec
CallSBERT
CCR
CDM optimization
Results
CallSBERT: Surveillance adaptation
Edge cases
High word error rate
Clipping
Missing transcript
Conclusion

Figures (7)

Figure 1: Architecture comparison of the parallel EncDec Blatt2022 (left) and the sequential CallSBERT model (right).
Figure 2: CCR architecture. The dotted lines mark the additional call-sign prediction path via command distributions.
Figure 3: Maximum accuracy of call-sign prediction based on command distributions with optimal filter parameters.
Figure 4: 2D coordinates of airplanes while receiving a vertical command (a) and 2D distribution maps (top view) of the vertical command in the 200 km $\cdot$ 200 km Prague airspace (b),(d). Dark colored areas have a high probability for vertical commands.
Figure 5: Call-sign accuracy depending on the surveillance size per test transcript. During fine-tuning, each transcript has either 4 or 24 corresponding surveillance call-signs.
...and 2 more figures

Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

TL;DR

Abstract

Utilizing Multimodal Data for Edge Case Robust Call-sign Recognition and Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)