LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Haechan Kim; Junho Myung; Seoyoung Kim; Sungpah Lee; Dongyeop Kang; Juho Kim

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Haechan Kim, Junho Myung, Seoyoung Kim, Sungpah Lee, Dongyeop Kang, Juho Kim

TL;DR

The linguistic analysis reveals that transcriptions in this dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies, significantly more than native speech datasets.

Abstract

Prevalent ungrammatical expressions and disfluencies in spontaneous speech from second language (L2) learners pose unique challenges to Automatic Speech Recognition (ASR) systems. However, few datasets are tailored to L2 learner speech. We publicly release LearnerVoice, a dataset consisting of 50.04 hours of audio and transcriptions of L2 learners' spontaneous speech. Our linguistic analysis reveals that transcriptions in our dataset contain L2S (L2 learner's Spontaneous speech) features, consisting of ungrammatical expressions and disfluencies (e.g., filler words, word repetitions, self-repairs, false starts), significantly more than native speech datasets. Fine-tuning whisper-small.en with LearnerVoice achieves a WER of 10.26%, 44.2% lower than vanilla whisper-small.en. Furthermore, our qualitative analysis indicates that 54.2% of errors from the vanilla model on LearnerVoice are attributable to L2S features, with 48.1% of them being reduced in the fine-tuned model.

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

TL;DR

Abstract

Paper Structure (17 sections, 4 figures, 3 tables)

This paper contains 17 sections, 4 figures, 3 tables.

Introduction
LearnerVoice Dataset
Dataset Overview
Dataset Construction
L2 Learners Distribution
L2S Features Distribution
Quantifying L2S Features in the Dataset
Comparison with Other Datasets
Fine-tuning with LearnerVoice
Dataset Used for Fine-Tuning
Experiment Setting
Experiment Result
ASR Error Tagging on L2 Speech
ASR Error Taxonomy
Methods
...and 2 more sections

Figures (4)

Figure 1: Examples of L2S features from the transcriptions of LearnerVoice, where filler words (FW), self-repairs (SR), and ungrammatical expressions (UE) are well represented. Word repetitions, self-repair, fragment, and false starts are all considered as types of self-repair and are labeled as SR.
Figure 2: Distribution of filler words per token (FW), self-repairs per token (SC) and grammatical errors per c-unit (GE) by datasets. FW and SC are followed the left-hand side y-axis and GE is followed the right-hand side y-axis. This statistically significant difference in the abundance of L2S features in LearnerVoice compared to Switchboard and Librispeech.
Figure 3: The distribution of error types for the vanilla whisper-small.en model indicates that ASR errors are predominantly influenced by error types related to L2S features. The error tags are represented in abbreviated form as shown in Table \ref{['tab:error_taxonomy_definition']}.
Figure 4: Change ratio of error counts by error types for the vanilla whisper-small.en and whisper-small.en fine-tuned by LearnerVoice. There is a much greater reduction in errors for error types related to L2S features compared to error types that are not related.

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

TL;DR

Abstract

LearnerVoice: A Dataset of Non-Native English Learners' Spontaneous Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (4)