Table of Contents
Fetching ...

Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning

Yuanxi Lin, Tonglin Zhou, Yang Xiao

TL;DR

The paper tackles aviation command recognition under noisy, resource-constrained conditions by advancing keyword spotting with a lightweight model. It introduces BC-SENet, which fuses broadcasted residual learning with Squeeze-and-Excitation and time-frame frequency-wise attention to emphasize crucial channels and frequencies while keeping parameters low. The authors contribute a Chinese Tower Commands dataset and demonstrate that BC-SENet achieves state-of-the-art accuracy on both the curated dataset and Google Speech Commands, with strong robustness to ambient noise. The work has practical implications for safer, more reliable air traffic communications and points toward future parameter-efficient attention mechanisms to further improve efficiency on edge devices.

Abstract

Accurate recognition of aviation commands is vital for flight safety and efficiency, as pilots must follow air traffic control instructions precisely. This paper addresses challenges in speech command recognition, such as noisy environments and limited computational resources, by advancing keyword spotting technology. We create a dataset of standardized airport tower commands, including routine and emergency instructions. We enhance broadcasted residual learning with squeeze-and-excitation and time-frame frequency-wise squeeze-and-excitation techniques, resulting in our BC-SENet model. This model focuses on crucial information with fewer parameters. Our tests on five keyword spotting models, including BC-SENet, demonstrate superior accuracy and efficiency. These findings highlight the effectiveness of our model advancements in improving speech command recognition for aviation safety and efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows comparable performance on the common Google Speech Command dataset.

Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning

TL;DR

The paper tackles aviation command recognition under noisy, resource-constrained conditions by advancing keyword spotting with a lightweight model. It introduces BC-SENet, which fuses broadcasted residual learning with Squeeze-and-Excitation and time-frame frequency-wise attention to emphasize crucial channels and frequencies while keeping parameters low. The authors contribute a Chinese Tower Commands dataset and demonstrate that BC-SENet achieves state-of-the-art accuracy on both the curated dataset and Google Speech Commands, with strong robustness to ambient noise. The work has practical implications for safer, more reliable air traffic communications and points toward future parameter-efficient attention mechanisms to further improve efficiency on edge devices.

Abstract

Accurate recognition of aviation commands is vital for flight safety and efficiency, as pilots must follow air traffic control instructions precisely. This paper addresses challenges in speech command recognition, such as noisy environments and limited computational resources, by advancing keyword spotting technology. We create a dataset of standardized airport tower commands, including routine and emergency instructions. We enhance broadcasted residual learning with squeeze-and-excitation and time-frame frequency-wise squeeze-and-excitation techniques, resulting in our BC-SENet model. This model focuses on crucial information with fewer parameters. Our tests on five keyword spotting models, including BC-SENet, demonstrate superior accuracy and efficiency. These findings highlight the effectiveness of our model advancements in improving speech command recognition for aviation safety and efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows comparable performance on the common Google Speech Command dataset.
Paper Structure (17 sections, 5 equations, 2 figures, 3 tables)

This paper contains 17 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Model architecture of the proposed BC-SEnet. The content in the yellow module describes the BC-ResBlock. The BC-ResBlock uses a frequency-depthwise convolution integrated with SSN. Then the feature undergoes averaging across the frequency dimension and is then processed through a temporal-depthwise separable convolution. Finally, the temporal feature is broadcasted back into 2D form at the residual connection.
  • Figure 2: An illustration of SE block (a) and tfwSE block on one-time frame (b). The tfwSE applies this procedure for every time frame.