Table of Contents
Fetching ...

Consistency Based Unsupervised Self-training For ASR Personalisation

Jisi Zhang, Vandana Rajan, Haaris Mehmood, David Tuckey, Pablo Peso Parada, Md Asif Jalal, Karthikeyan Saravanan, Gil Ho Lee, Jungin Lee, Seokyeong Jung

TL;DR

This work tackles domain shift in on-device ASR by introducing a consistency-based unsupervised self-training framework for personalisation. A consistency constraint (CC) is applied to pseudo-labelled data, with perturbations to both inputs (SpecAugment) and the model (dropout), and the first-pass ASR component is updated using the CC loss $L=-\ln \Pr(\hat{y}|\tilde{x})$ while the second-pass provides pseudo-labels. The method is combined with data-filtering via a Neural Confidence Measure (NCM) and evaluated on 12 speaker scenarios, achieving relative WER reductions of $17.3\%$, $7.2\%$, and $8.1\%$ on Apps, Contacts, and Dictation respectively, surpassing entropy minimisation and LHUC baselines to set new SOTA. Importantly, CC is shown to be robust across different data-filtering strategies and suitable for on-device deployment, suggesting practical impact for personalised ASR without labelled user data.

Abstract

On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision. Personalisation without any labelled data is challenging due to limited data size and poor quality of recorded audio samples. This work addresses unsupervised personalisation by developing a novel consistency based training method via pseudo-labelling. Our method achieves a relative Word Error Rate Reduction (WERR) of 17.3% on unlabelled training data and 8.1% on held-out data compared to a pre-trained model, and outperforms the current state-of-the art methods.

Consistency Based Unsupervised Self-training For ASR Personalisation

TL;DR

This work tackles domain shift in on-device ASR by introducing a consistency-based unsupervised self-training framework for personalisation. A consistency constraint (CC) is applied to pseudo-labelled data, with perturbations to both inputs (SpecAugment) and the model (dropout), and the first-pass ASR component is updated using the CC loss while the second-pass provides pseudo-labels. The method is combined with data-filtering via a Neural Confidence Measure (NCM) and evaluated on 12 speaker scenarios, achieving relative WER reductions of , , and on Apps, Contacts, and Dictation respectively, surpassing entropy minimisation and LHUC baselines to set new SOTA. Importantly, CC is shown to be robust across different data-filtering strategies and suitable for on-device deployment, suggesting practical impact for personalised ASR without labelled user data.

Abstract

On-device Automatic Speech Recognition (ASR) models trained on speech data of a large population might underperform for individuals unseen during training. This is due to a domain shift between user data and the original training data, differed by user's speaking characteristics and environmental acoustic conditions. ASR personalisation is a solution that aims to exploit user data to improve model robustness. The majority of ASR personalisation methods assume labelled user data for supervision. Personalisation without any labelled data is challenging due to limited data size and poor quality of recorded audio samples. This work addresses unsupervised personalisation by developing a novel consistency based training method via pseudo-labelling. Our method achieves a relative Word Error Rate Reduction (WERR) of 17.3% on unlabelled training data and 8.1% on held-out data compared to a pre-trained model, and outperforms the current state-of-the art methods.
Paper Structure (14 sections, 1 equation, 4 figures, 3 tables, 1 algorithm)

This paper contains 14 sections, 1 equation, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: Streaming two-pass end-to-end ASR model architecture. The first pass model is a conformer based transducer. The second pass model is an attention-based encoder-decoder model (LAS). NCM classifier is a confidence estimation module that uses intermediate ASR features for WER based data filtering.
  • Figure 2: Unsupervised personalisation pipeline based on data filtering and consistency constraint
  • Figure 3: Word Error Rate Reduction (WERR) compared to the pre-trained model for Apps, Contacts & Dictation using consistency training (CC) and unsupervised NST for 20 rounds with a choice of 1, 3 or 5 epochs per round. Higher values are better. Plot values are smoothed using an exponential moving average with weight of 0.6. Best viewed in colour.
  • Figure 4: ASR personalisation results for each of the 12 individual users. The pre-trained model, NST trained on unfiltered data, and the proposed method are compared in the plot. (Top: Apps data, Middle: Contacts data, Bottom: Dictation data.)