Table of Contents
Fetching ...

Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis

Samuel S. Sohn, Sten Knutsen, Karin Stromswold

TL;DR

This paper addresses the challenge of prosodic stress recognition in ASR by fine-tuning Whisper large-v2 to detect phrasal, lexical, and contrastive stress using a dataset of 66 speakers that includes neurotypical and neurodivergent participants. It investigates cross-stress acoustic transfer and demonstrates that an all-stress fine-tuned model attains near-human transcription accuracy, while also exploring gender and neurotype classification with high precision and practical fallback mechanisms. The work reveals shared acoustic cues between certain stress types, highlights weaknesses in phrasal stress transfer, and shows the feasibility of stress-aware transcription to promote equitable speech technologies. Overall, the study contributes to more inclusive transcription systems and informs theory on prosody integration with syntax and semantics.

Abstract

Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, including male, female, neurotypical, and neurodivergent individuals, we assess the model's ability to generalize stress patterns and classify speakers by neurotype and gender based on brief speech samples. Our results highlight near-human accuracy in ASR performance across all three stress types and near-perfect precision in classifying gender and neurotype. By improving prosody-aware ASR, this work contributes to equitable and robust transcription technologies for diverse populations.

Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis

TL;DR

This paper addresses the challenge of prosodic stress recognition in ASR by fine-tuning Whisper large-v2 to detect phrasal, lexical, and contrastive stress using a dataset of 66 speakers that includes neurotypical and neurodivergent participants. It investigates cross-stress acoustic transfer and demonstrates that an all-stress fine-tuned model attains near-human transcription accuracy, while also exploring gender and neurotype classification with high precision and practical fallback mechanisms. The work reveals shared acoustic cues between certain stress types, highlights weaknesses in phrasal stress transfer, and shows the feasibility of stress-aware transcription to promote equitable speech technologies. Overall, the study contributes to more inclusive transcription systems and informs theory on prosody integration with syntax and semantics.

Abstract

Prosody plays a crucial role in speech perception, influencing both human understanding and automatic speech recognition (ASR) systems. Despite its importance, prosodic stress remains under-studied due to the challenge of efficiently analyzing it. This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, including male, female, neurotypical, and neurodivergent individuals, we assess the model's ability to generalize stress patterns and classify speakers by neurotype and gender based on brief speech samples. Our results highlight near-human accuracy in ASR performance across all three stress types and near-perfect precision in classifying gender and neurotype. By improving prosody-aware ASR, this work contributes to equitable and robust transcription technologies for diverse populations.

Paper Structure

This paper contains 5 sections, 3 tables.