Table of Contents
Fetching ...

Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom

Melissa Torgbi, Andrew Clayman, Jordan J. Speight, Harish Tayyar Madabushi

TL;DR

The study investigates how regional UK dialects affect automatic speech recognition using Whisper large-v3, focusing on two Scottish accents and real-world public-service calls. It introduces novel NESAC and SESHA datasets, evaluates out-of-the-box Whisper performance, and demonstrates that fine-tuning on region-specific data can improve transcription for targeted accents, with some transferability to nearby dialects. A thorough manual error analysis reveals that WER can misrepresent actual transcription quality due to differences in transcription style and context, underscoring the need for qualitative evaluation alongside quantitative metrics. The work highlights the potential and limitations of fine-tuning ASR models for public services serving vulnerable populations and emphasizes careful balance between accent adaptation and maintaining contextual understanding.

Abstract

We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.

Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom

TL;DR

The study investigates how regional UK dialects affect automatic speech recognition using Whisper large-v3, focusing on two Scottish accents and real-world public-service calls. It introduces novel NESAC and SESHA datasets, evaluates out-of-the-box Whisper performance, and demonstrates that fine-tuning on region-specific data can improve transcription for targeted accents, with some transferability to nearby dialects. A thorough manual error analysis reveals that WER can misrepresent actual transcription quality due to differences in transcription style and context, underscoring the need for qualitative evaluation alongside quantitative metrics. The work highlights the potential and limitations of fine-tuning ASR models for public services serving vulnerable populations and emphasizes careful balance between accent adaptation and maintaining contextual understanding.

Abstract

We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.
Paper Structure (20 sections, 3 figures, 8 tables)

This paper contains 20 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Word error rate of the Whisper large-v3 model on the baseline dataset and two test datasets NESAC test data and SESHA test data.
  • Figure 2: WER of the Whisper large-v3 model, the NESAC fine-tuned model and SESHA fine-tuned model on the baseline dataset and two test datasets NESAC test data and SESHA test data.
  • Figure 3: Difference in average word error rate (WER) from Whisper large-v3 after cumulative automated optimisation steps.