Table of Contents
Fetching ...

Non-verbal information in spontaneous speech -- towards a new framework of analysis

Tirza Biron, Moshe Barboy, Eran Ben-Artzy, Alona Golubchik, Yanir Marmor, Smadar Szekely, Yaron Winter, David Harel

TL;DR

An analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning are offered, which interprets surface-representations of multi-layered prosodic events.

Abstract

Non-verbal signals in speech are encoded by prosody and carry information that ranges from conversation action to attitude and emotion. Despite its importance, the principles that govern prosodic structure are not yet adequately understood. This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning. The schema interprets surface-representations of multi-layered prosodic events. As a first step towards implementation, we present a classification process that disentangles prosodic phenomena of three orders. It relies on fine-tuning a pre-trained speech recognition model, enabling the simultaneous multi-class/multi-label detection. It generalizes over a large variety of spontaneous data, performing on a par with, or superior to, human annotation. In addition to a standardized formalization of prosody, disentangling prosodic patterns can direct a theory of communication and speech organization. A welcome by-product is an interpretation of prosody that will enhance speech- and language-related technologies.

Non-verbal information in spontaneous speech -- towards a new framework of analysis

TL;DR

An analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning are offered, which interprets surface-representations of multi-layered prosodic events.

Abstract

Non-verbal signals in speech are encoded by prosody and carry information that ranges from conversation action to attitude and emotion. Despite its importance, the principles that govern prosodic structure are not yet adequately understood. This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning. The schema interprets surface-representations of multi-layered prosodic events. As a first step towards implementation, we present a classification process that disentangles prosodic phenomena of three orders. It relies on fine-tuning a pre-trained speech recognition model, enabling the simultaneous multi-class/multi-label detection. It generalizes over a large variety of spontaneous data, performing on a par with, or superior to, human annotation. In addition to a standardized formalization of prosody, disentangling prosodic patterns can direct a theory of communication and speech organization. A welcome by-product is an interpretation of prosody that will enhance speech- and language-related technologies.
Paper Structure (22 sections, 2 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 22 sections, 2 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of the analytical hierarchy for IUs. Note that emotion, emphasis and attitude are orthogonal to the prosodic prototype and discourse function hierarchy.
  • Figure 2: Log pitch course, median-normalized and time-normalized, of manually (a.,c.,e.) and automatically (b., d., f.) annotated IUs, “This American Life” corpusGlass1995ThisAmericanLife. 2(a-b) Log pitch course of the prototypes “continuation” (“comma”; 2a n=3,184; 2b n=34,455) and “conclusion” (“period”; 2a n=2,323; 2b n=27,415) for manual and automatic annotation, respectively. 2(c-d) Log pitch course of “continuation” IUs that bear emphasis in their first half (blue), second half (orange), and all “continuation” IUs (green), for manual and automatic annotation, respectively. 2(e-f) Log pitch course of “continuation” for IUs that bear emphasis in their first half (blue) or their second half (orange), for manual and automatic annotation, respectively. Note the influence of the underlying “comma” pitch pattern on the production of emphasis, and the resemblance between manually annotated and automatically obtained IUs.
  • Figure 3: An example of the categories and labels that constitute the prosodic messages in https://a7ce520753ab849816faaa3f4fc591b1.cdn.bubble.io/f1701163514358x930943014691363400/01You_have_no_guide_when_you're_a_young_person_(rethorical_questions_list).wav. Excerpt drawn from du2000santa.
  • Figure 4: Training scheme. The backbone of our method is the fine-tuned WHISPER radford2023robust. Its input includes speech audio, its corresponding text and prosodic labels; output predicts label combinations for each word of the input text.
  • Figure 5: Impact of model size on performance of re-trained WHISPER for three simultaneous tasks. (5b) Tests on TAL dataset, with/out considering the first word of a turn.