Table of Contents
Fetching ...

Synthetic Audio Helps for Cognitive State Tasks

Adil Soubki, John Murzaku, Peter Zeng, Owen Rambow

TL;DR

The paper addresses the limitation of text-only approaches for cognitive-state tasks in NLP by introducing Synthetic Audio Data (SAD), which uses zero-shot text-to-speech to generate synthetic audio and augment multimodal training. The authors implement a SAD pipeline that jointly tunes text and audio encoders (e.g., BERT and Whisper) and explores early and late fusion, evaluating on seven tasks across sentiment, belief, emotion, and control categories. They demonstrate that synthetic audio can provide orthogonal cues, improving performance over text-only baselines and approaching, in some cases, gold-audio performance, while also benefiting datasets that lack any human audio. The work highlights the potential of leveraging synthetic audio for cognitive-state tasks, discusses cost- and data-related considerations, and releases code to support future research in multimodal NLP with synthetic signals.

Abstract

The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.

Synthetic Audio Helps for Cognitive State Tasks

TL;DR

The paper addresses the limitation of text-only approaches for cognitive-state tasks in NLP by introducing Synthetic Audio Data (SAD), which uses zero-shot text-to-speech to generate synthetic audio and augment multimodal training. The authors implement a SAD pipeline that jointly tunes text and audio encoders (e.g., BERT and Whisper) and explores early and late fusion, evaluating on seven tasks across sentiment, belief, emotion, and control categories. They demonstrate that synthetic audio can provide orthogonal cues, improving performance over text-only baselines and approaching, in some cases, gold-audio performance, while also benefiting datasets that lack any human audio. The work highlights the potential of leveraging synthetic audio for cognitive-state tasks, discusses cost- and data-related considerations, and releases code to support future research in multimodal NLP with synthetic signals.

Abstract

The NLP community has broadly focused on text-only approaches of cognitive state tasks, but audio can provide vital missing cues through prosody. We posit that text-to-speech models learn to track aspects of cognitive state in order to produce naturalistic audio, and that the signal audio models implicitly identify is orthogonal to the information that language models exploit. We present Synthetic Audio Data fine-tuning (SAD), a framework where we show that 7 tasks related to cognitive state modeling benefit from multimodal training on both text and zero-shot synthetic audio data from an off-the-shelf TTS system. We show an improvement over the text-only modality when adding synthetic audio data to text-only corpora. Furthermore, on tasks and corpora that do contain gold audio, we show our SAD framework achieves competitive performance with text and synthetic audio compared to text and gold audio.

Paper Structure

This paper contains 33 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Overview of the SAD framework, beginning with a text input. We then perform zero-shot TTS on the text to get audio and then fine-tune an audio model. In parallel, we fine-tune a text model. We then fuse the features from both modalities to get a final prediction.