Table of Contents
Fetching ...

The Role of Prosody in Spoken Question Answering

Jie Chi, Maureen de Seyssel, Natalie Schluter

TL;DR

The paper investigates the role of prosody in Spoken Question Answering by systematically decoupling prosodic and lexical cues in natural speech data. It uses the SLUE-SQA-5 dataset and a non-cascade DUAL model to evaluate how prosody and lexical information contribute to locating answer spans. The findings show that prosody provides meaningful, complementary cues, but lexical content dominates when both are present, suggesting current models underutilize prosody. The work highlights the need for better integration of prosodic information to robustly perform SQA, especially under degraded lexical conditions.

Abstract

Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody--additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.

The Role of Prosody in Spoken Question Answering

TL;DR

The paper investigates the role of prosody in Spoken Question Answering by systematically decoupling prosodic and lexical cues in natural speech data. It uses the SLUE-SQA-5 dataset and a non-cascade DUAL model to evaluate how prosody and lexical information contribute to locating answer spans. The findings show that prosody provides meaningful, complementary cues, but lexical content dominates when both are present, suggesting current models underutilize prosody. The work highlights the need for better integration of prosodic information to robustly perform SQA, especially under degraded lexical conditions.

Abstract

Spoken language understanding research to date has generally carried a heavy text perspective. Most datasets are derived from text, which is then subsequently synthesized into speech, and most models typically rely on automatic transcriptions of speech. This is to the detriment of prosody--additional information carried by the speech signal beyond the phonetics of the words themselves and difficult to recover from text alone. In this work, we investigate the role of prosody in Spoken Question Answering. By isolating prosodic and lexical information on the SLUE-SQA-5 dataset, which consists of natural speech, we demonstrate that models trained on prosodic information alone can perform reasonably well by utilizing prosodic cues. However, we find that when lexical information is available, models tend to predominantly rely on it. Our findings suggest that while prosodic cues provide valuable supplementary information, more effective integration methods are required to ensure prosody contributes more significantly alongside lexical features.

Paper Structure

This paper contains 13 sections, 2 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of the SQA format
  • Figure 2: Spectrogram of the example speech under different conditions. In each sub-figure, the top plot is the waveform, the second plot is the spectrogram, the third plot is the intensity, and the bottom plot is the F0.
  • Figure 3: Illustration of ground truth span, predicted span and overlapping span for evaluation.
  • Figure 4: Performance on the test set with different cut-off frequencies
  • Figure 5: Evaluation loss across different conditions. From left to right, the model is trained on (1) the full prosodic training set, (2) a combination of the full prosodic training set and 5% from both the lexical and natural training sets, and (3) the full training sets for all conditions. The reference line indicates the lowest prosodic loss when the model is trained on all conditions..