Table of Contents
Fetching ...

WHISTRESS: Enriching Transcriptions with Sentence Stress Detection

Iddo Yosha, Dorin Shteyman, Yossi Adi

TL;DR

WhiStress addresses sentence-stress detection by extending the Whisper ASR backbone with a token-level stress head, achieving alignment-free training and inference. A fully automated synthetic data pipeline yields TinyStress-15K, with $15{,}000$ training and $1{,}000$ test samples totaling ~ $15$ hours of speech, enabling scalable supervision without manual annotations. Empirical results show WhiStress outperforms baselines on TinyStress-15K and Aix-MARSEC, while also exhibiting strong zero-shot generalization to Expresso and EmphAssess, thanks to robust cross-domain learning. Layer analysis reveals a prosodic-linguistic trade-off, with the $9$th layer offering the best balance for stress detection, guiding future design of prosody-aware transformers. The work provides practical impact by enabling richer, stress-aware transcriptions and publicly releasing code and TinyStress-15K for future research.

Abstract

Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: https://pages.cs.huji.ac.il/adiyoss-lab/whistress.

WHISTRESS: Enriching Transcriptions with Sentence Stress Detection

TL;DR

WhiStress addresses sentence-stress detection by extending the Whisper ASR backbone with a token-level stress head, achieving alignment-free training and inference. A fully automated synthetic data pipeline yields TinyStress-15K, with training and test samples totaling ~ hours of speech, enabling scalable supervision without manual annotations. Empirical results show WhiStress outperforms baselines on TinyStress-15K and Aix-MARSEC, while also exhibiting strong zero-shot generalization to Expresso and EmphAssess, thanks to robust cross-domain learning. Layer analysis reveals a prosodic-linguistic trade-off, with the th layer offering the best balance for stress detection, guiding future design of prosody-aware transformers. The work provides practical impact by enabling richer, stress-aware transcriptions and publicly releasing code and TinyStress-15K for future research.

Abstract

Spoken language conveys meaning not only through words but also through intonation, emotion, and emphasis. Sentence stress, the emphasis placed on specific words within a sentence, is crucial for conveying speaker intent and has been extensively studied in linguistics. In this work, we introduce WHISTRESS, an alignment-free approach for enhancing transcription systems with sentence stress detection. To support this task, we propose TINYSTRESS-15K, a scalable, synthetic training data for the task of sentence stress detection which resulted from a fully automated dataset creation process. We train WHISTRESS on TINYSTRESS-15K and evaluate it against several competitive baselines. Our results show that WHISTRESS outperforms existing methods while requiring no additional input priors during training or inference. Notably, despite being trained on synthetic data, WHISTRESS demonstrates strong zero-shot generalization across diverse benchmarks. Project page: https://pages.cs.huji.ac.il/adiyoss-lab/whistress.

Paper Structure

This paper contains 13 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: WhiStress Architecture. The Whisper model is kept frozen during training. The extension is a transformer decoder block with cross-attention for the audio embeddings, followed by an FCNN classifier that outputs the stress score per token.
  • Figure 2: Prosodic features prediction by Mean Absolute Error (MAE) percentage of Whisper layer embeddings. A lower MAE percentage indicates better prediction. Each curve shows confidence intervals.