Table of Contents
Fetching ...

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

TL;DR

The linguistic analysis reveals that transcriptions in this dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies, significantly more than native speech datasets.

Abstract

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

TL;DR

The linguistic analysis reveals that transcriptions in this dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies, significantly more than native speech datasets.

Abstract

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.
Paper Structure (17 sections, 4 figures, 3 tables)

This paper contains 17 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples of L2S features from the transcriptions of LearnerVoice, where filler words (FW), self-repairs (SR), and ungrammatical expressions (UE) are well represented. Word repetitions, self-repair, fragment, and false starts are all considered as types of self-repair and are labeled as SR.
  • Figure 2: Distribution of filler words per token (FW), self-repairs per token (SC) and grammatical errors per c-unit (GE) by datasets. FW and SC are followed the left-hand side y-axis and GE is followed the right-hand side y-axis. This statistically significant difference in the abundance of L2S features in LearnerVoice compared to Switchboard and Librispeech.
  • Figure 3: The distribution of error types for the vanilla whisper-small.en model indicates that ASR errors are predominantly influenced by error types related to L2S features. The error tags are represented in abbreviated form as shown in Table \ref{['tab:error_taxonomy_definition']}.
  • Figure 4: Change ratio of error counts by error types for the vanilla whisper-small.en and whisper-small.en fine-tuned by LearnerVoice. There is a much greater reduction in errors for error types related to L2S features compared to error types that are not related.