Table of Contents
Fetching ...

ASR Benchmarking: Need for a More Representative Conversational Dataset

Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad

TL;DR

This work tackles the gap between traditional ASR benchmarks and real-world conversational speech by introducing a multilingual TalkBank-based conversational dataset. It presents a rigorous preprocessing pipeline to align transcripts with audio, maps speaker channels via VAD, and filters data to yield 151,705 segments across eight languages for robust benchmarking. Evaluating Whisper, wav2vec2, and Canary reveals substantial $WER$ degradation on TalkBank compared with LibriSpeech, Fleurs, and CommonVoice, and a clear link between speech disfluencies and error rates. The findings underscore the need for more representative conversational benchmarks to drive the development of ASR systems robust to natural discourse, with future work planned to broaden demographics, release fine-tuned models, and expand conversational settings.

Abstract

Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.

ASR Benchmarking: Need for a More Representative Conversational Dataset

TL;DR

This work tackles the gap between traditional ASR benchmarks and real-world conversational speech by introducing a multilingual TalkBank-based conversational dataset. It presents a rigorous preprocessing pipeline to align transcripts with audio, maps speaker channels via VAD, and filters data to yield 151,705 segments across eight languages for robust benchmarking. Evaluating Whisper, wav2vec2, and Canary reveals substantial degradation on TalkBank compared with LibriSpeech, Fleurs, and CommonVoice, and a clear link between speech disfluencies and error rates. The findings underscore the need for more representative conversational benchmarks to drive the development of ASR systems robust to natural discourse, with future work planned to broaden demographics, release fine-tuned models, and expand conversational settings.

Abstract

Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.
Paper Structure (15 sections, 1 equation, 3 figures, 2 tables)

This paper contains 15 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Distribution of audio length in seconds per language.
  • Figure 2: WER of various ASR system with respect to Spanish (ES), English (EN), French (FR), and German (DE).
  • Figure 3: The Pearson correlation coefficient between the number of conversation-specific elements, normalized by the length of the transcript, and WER. Here, 'non-verbal' includes expressions like 'mhh' and 'huh'; 'special characters' refer to speech intonation; 'special utterances' refer to markers like trailing off and interruptions; and 'events' refer to actions such as coughing, groaning, and sneezing. Finally "overall" refers to combined effect.