Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

Pan-Pan Jiang; Jimmy Tobin; Katrin Tomanek; Robert L. MacDonald; Katie Seaver; Richard Cave; Marilyn Ladewig; Rus Heywood; Jordan R. Green

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

Pan-Pan Jiang, Jimmy Tobin, Katrin Tomanek, Robert L. MacDonald, Katie Seaver, Richard Cave, Marilyn Ladewig, Rus Heywood, Jordan R. Green

TL;DR

The project's latest advancements in data collection and annotation methodologies are described, such as expanding speaker diversity in the database, adding human-reviewed transcript corrections and audio quality tags to 350K audio recordings, and amassing a comprehensive set of metadata for over 75\% of the speakers in the database.

Abstract

Project Euphonia, a Google initiative, is dedicated to improving automatic speech recognition (ASR) of disordered speech. A central objective of the project is to create a large, high-quality, and diverse speech corpus. This report describes the project's latest advancements in data collection and annotation methodologies, such as expanding speaker diversity in the database, adding human-reviewed transcript corrections and audio quality tags to 350K (of the 1.2M total) audio recordings, and amassing a comprehensive set of metadata (including more than 40 speech characteristic labels) for over 75\% of the speakers in the database. We report on the impact of transcript corrections on our machine-learning (ML) research, inter-rater variability of assessments of disordered speech patterns, and our rationale for gathering speech metadata. We also consider the limitations of using automated off-the-shelf annotation methods for assessing disordered speech.

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

TL;DR

Abstract

Paper Structure (12 sections, 2 figures)

This paper contains 12 sections, 2 figures.

Introduction
Expanding Data Diversity
Increasing the diversity of speakers and etiologies
Documenting diversity of atypical speech patterns
Expanding linguistic and speech pattern diversity
Improving trustworthiness of corpus
Manual data validation and cleaning
Testing automated approaches for identifying low-quality data
Establishing the replicability of human expert labeling
Improving Data Collection Efficiency
Data Access
Summary and Future Directions

Figures (2)

Figure 1: Analyzed VAD negative decisions grouped by etiology. False omission rate is labeled for each etiology. (Top) utterance counts, (bottom) speaker counts.
Figure 2: Estimates of inter-rater reliability for nine of the disordered speech labels.

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

TL;DR

Abstract

Learnings from curating a trustworthy, well-annotated, and useful dataset of disordered English speech

Authors

TL;DR

Abstract

Table of Contents

Figures (2)