Table of Contents
Fetching ...

Noise-Robust Keyword Spotting through Self-supervised Pretraining

Jacob Mørk, Holger Severin Bovbjerg, Gergely Kiss, Zheng-Hua Tan

TL;DR

This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored and finds that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions.

Abstract

Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored. Models of three different sizes are pretrained using different pretraining approaches and then fine-tuned for KWS. These models are then tested and compared to models trained using two baseline supervised learning methods, one being standard training using clean data and the other one being multi-style training (MTR). The results show that pretraining and fine-tuning on clean data is superior to supervised learning on clean data across all testing conditions, and superior to supervised MTR for testing conditions of SNR above 5 dB. This indicates that pretraining alone can increase the model's robustness. Finally, it is found that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions.

Noise-Robust Keyword Spotting through Self-supervised Pretraining

TL;DR

This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored and finds that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions.

Abstract

Voice assistants are now widely available, and to activate them a keyword spotting (KWS) algorithm is used. Modern KWS systems are mainly trained using supervised learning methods and require a large amount of labelled data to achieve a good performance. Leveraging unlabelled data through self-supervised learning (SSL) has been shown to increase the accuracy in clean conditions. This paper explores how SSL pretraining such as Data2Vec can be used to enhance the robustness of KWS models in noisy conditions, which is under-explored. Models of three different sizes are pretrained using different pretraining approaches and then fine-tuned for KWS. These models are then tested and compared to models trained using two baseline supervised learning methods, one being standard training using clean data and the other one being multi-style training (MTR). The results show that pretraining and fine-tuning on clean data is superior to supervised learning on clean data across all testing conditions, and superior to supervised MTR for testing conditions of SNR above 5 dB. This indicates that pretraining alone can increase the model's robustness. Finally, it is found that using noisy data for pretraining models, especially with the Data2Vec-denoising approach, significantly enhances the robustness of KWS models in noisy conditions.
Paper Structure (13 sections, 1 equation, 4 figures, 5 tables)

This paper contains 13 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Illustration of the KWS system.
  • Figure 2: Illustration of the various Data2Vec pretraining setups. Here, the Data2Vec-clean input signal follows the yellow path, Data2Vec-noisy follows the blue, and Data2Vec-denoising follows the green. Black arrows denote signal paths common for all three configurations.
  • Figure 3: Visualization of the results for the KWT-1 models, tested on data with seen noise types. The results from the KWT-2 and KWT-3 models follow a similar pattern.
  • Figure 4: Visualization of the results for the KWT-1 models, tested on data with unseen noise types. The results from the KWT-2 and KWT-3 models follow a similar pattern.