Table of Contents
Fetching ...

Spontaneous Informal Speech Dataset for Punctuation Restoration

Xing Yi Liu, Homayoon Beigi

TL;DR

SponSpeech is introduced, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information, and a filtering pipeline is contributed that examines the quality of both speech audio and transcription text.

Abstract

Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.

Spontaneous Informal Speech Dataset for Punctuation Restoration

TL;DR

SponSpeech is introduced, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information, and a filtering pipeline is contributed that examines the quality of both speech audio and transcription text.

Abstract

Presently, punctuation restoration models are evaluated almost solely on well-structured, scripted corpora. On the other hand, real-world ASR systems and post-processing pipelines typically apply towards spontaneous speech with significant irregularities, stutters, and deviations from perfect grammar. To address this discrepancy, we introduce SponSpeech, a punctuation restoration dataset derived from informal speech sources, which includes punctuation and casing information. In addition to publicly releasing the dataset, we contribute a filtering pipeline that can be used to generate more data. Our filtering pipeline examines the quality of both speech audio and transcription text. We also carefully construct a ``challenging" test set, aimed at evaluating models' ability to leverage audio information to predict otherwise grammatically ambiguous punctuation. SponSpeech is available at https://github.com/GitHubAccountAnonymous/PR, along with all code for dataset building and model runs.
Paper Structure (10 sections, 1 figure, 7 tables)

This paper contains 10 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: Filtering pipeline used to create SponSpeech. Blue indicates text-based filter, and yellow indicates audio-based filter, also shown by the bottom-right icons. Sub-boxes within the subtitle quality and appropriateness filters are the evaluation criteria used.