Table of Contents
Fetching ...

Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation

Eyal Liron Dolev, Clemens Fidel Lutz, Noëmi Aepli

TL;DR

This paper evaluates Whisper's zero-shot ability to handle Swiss German by transcribing to Standard German across three large corpora and a newly created Mock Clinical Interviews dataset. It employs automatic metrics (WER, BLEU), qualitative error analysis, and a human survey (n=28) to provide a comprehensive assessment. Whisper large-v3 demonstrates competitive automatic performance and high human satisfaction, with continuous recordings outperforming short clips and generally faithful translations, albeit with occasional tense shifts, lexical adjustments, and rare hallucinations. The findings support using Whisper out-of-the-box for Swiss German transcription into Standard German in many applied settings, while cautioning users to verify audio and be mindful of potential errors for critical tasks.

Abstract

Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Schönberger et al., 2021), STT4SG-350 (Plüss et al., 2023), and Swiss Parliaments Corpus (Plüss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.

Does Whisper understand Swiss German? An automatic, qualitative, and human evaluation

TL;DR

This paper evaluates Whisper's zero-shot ability to handle Swiss German by transcribing to Standard German across three large corpora and a newly created Mock Clinical Interviews dataset. It employs automatic metrics (WER, BLEU), qualitative error analysis, and a human survey (n=28) to provide a comprehensive assessment. Whisper large-v3 demonstrates competitive automatic performance and high human satisfaction, with continuous recordings outperforming short clips and generally faithful translations, albeit with occasional tense shifts, lexical adjustments, and rare hallucinations. The findings support using Whisper out-of-the-box for Swiss German transcription into Standard German in many applied settings, while cautioning users to verify audio and be mindful of potential errors for critical tasks.

Abstract

Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Schönberger et al., 2021), STT4SG-350 (Plüss et al., 2023), and Swiss Parliaments Corpus (Plüss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.
Paper Structure (27 sections, 3 figures, 10 tables)

This paper contains 27 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: The first three dialectal translations of the first entry in the SwissDial corpus. The first word in the Standard German source ("de"), derzeit, is translated differently in each dialect: zur ziit, momentan, derziit.
  • Figure 2: Distribution of WER scores for each corpus.
  • Figure 3: Distribution of BLEU scores for each corpus.