You don't understand me!: Comparing ASR results for L1 and L2 speakers of Swedish
Ronald Cumbal, Birger Moell, Jose Lopes, Olof Engwall
TL;DR
This study systematically compares three ASR services (Google Cloud, Microsoft Azure, and Huggingface/Wav2Vec2) on Swedish L1 and L2 speech across read and spontaneous formats using two datasets (Ville and CORALL). It quantifies recognition performance with Word Error Rate ($WER$) and introduces a dual error metric set, $ef_{w}$ and $enf_{w}$, to examine which words are most often misrecognized and how word frequency shapes errors. The results show a clear native advantage in $WER$, strongest in read speech, but the gap narrows for spontaneous speech and varies by service (notably, Microsoft Azure shows a native advantage in conversation contexts). Key findings highlight that short utterances are frequently unrecognized, and certain essential learner-support words (e.g., $förstår$, $repetera$) are particularly problematic, underscoring challenges for educational applications and the need for targeted fine-tuning and domain adaptation. The work thus informs the practical deployment of ASR in language learning tools and outlines directions for improving robustness to non-native Swedish in realistic interactions.
Abstract
The performance of Automatic Speech Recognition (ASR) systems has constantly increased in state-of-the-art development. However, performance tends to decrease considerably in more challenging conditions (e.g., background noise, multiple speaker social conversations) and with more atypical speakers (e.g., children, non-native speakers or people with speech disorders), which signifies that general improvements do not necessarily transfer to applications that rely on ASR, e.g., educational software for younger students or language learners. In this study, we focus on the gap in performance between recognition results for native and non-native, read and spontaneous, Swedish utterances transcribed by different ASR services. We compare the recognition results using Word Error Rate and analyze the linguistic factors that may generate the observed transcription errors.
