Table of Contents
Fetching ...

StressTest: Can YOUR Speech LM Handle the Stress?

Iddo Yosha, Gallil Maimon, Yossi Adi

TL;DR

StressTest addresses a gap in speech-aware language modeling by formalizing sentence stress understanding as both detection and reasoning. The authors introduce StressTest and Stress-17k, a synthetic data pipeline that enables training StresSLM to infer speaker intent from prosodic stress, showing substantial gains over existing models on SSR/SSD while preserving core ASR and SER capabilities. The approach combines a WhiStress-based verification, TTS-driven stressed speech synthesis, and a multi-task fine-tuning regime, demonstrating that end-to-end or near-end-to-end stress reasoning can outperform cascade methods that rely on transcription. The work highlights the practical importance of prosodic cues for meaning and provides a scalable path to robust stress-aware speech-language understanding with implications for accessibility and nuanced human–AI interaction.

Abstract

Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea. It is often used to imply an underlying intention not explicitly stated. Recent speech-aware language models (SLMs) have enabled direct audio processing, allowing models to access the full richness of speech to perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs. We address this gap by introducing StressTest, a benchmark designed to evaluate models' ability to distinguish between meanings of speech based on the stress pattern. We evaluate leading SLMs, and find that despite their overall capabilities, they perform poorly on such tasks. Hence, we propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Results suggest, that our finetuned model, StresSLM, generalizes well to real recordings and notably outperforms existing SLMs on sentence stress reasoning and detection. Models, code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

StressTest: Can YOUR Speech LM Handle the Stress?

TL;DR

StressTest addresses a gap in speech-aware language modeling by formalizing sentence stress understanding as both detection and reasoning. The authors introduce StressTest and Stress-17k, a synthetic data pipeline that enables training StresSLM to infer speaker intent from prosodic stress, showing substantial gains over existing models on SSR/SSD while preserving core ASR and SER capabilities. The approach combines a WhiStress-based verification, TTS-driven stressed speech synthesis, and a multi-task fine-tuning regime, demonstrating that end-to-end or near-end-to-end stress reasoning can outperform cascade methods that rely on transcription. The work highlights the practical importance of prosodic cues for meaning and provides a scalable path to robust stress-aware speech-language understanding with implications for accessibility and nuanced human–AI interaction.

Abstract

Sentence stress refers to emphasis on words within a spoken utterance to highlight or contrast an idea. It is often used to imply an underlying intention not explicitly stated. Recent speech-aware language models (SLMs) have enabled direct audio processing, allowing models to access the full richness of speech to perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs. We address this gap by introducing StressTest, a benchmark designed to evaluate models' ability to distinguish between meanings of speech based on the stress pattern. We evaluate leading SLMs, and find that despite their overall capabilities, they perform poorly on such tasks. Hence, we propose a novel data generation pipeline, and create Stress-17k, a training set that simulates change of meaning implied by stress variation. Results suggest, that our finetuned model, StresSLM, generalizes well to real recordings and notably outperforms existing SLMs on sentence stress reasoning and detection. Models, code, data, samples - pages.cs.huji.ac.il/adiyoss-lab/stresstest.

Paper Structure

This paper contains 67 sections, 1 equation, 4 figures, 12 tables.

Figures (4)

  • Figure 1: StressTest provides samples that can be understood differently based on stress. We consider sentence stress detection (SSD) and sentence stress reasoning (SSR). StresSLM detects stress and reasons about the meaning.
  • Figure 2: An illustrative example of the synthetic training data generation process.
  • Figure 3: Categorization of sentence stress types in StressTest.
  • Figure 4: Human evaluation annotation view.