Table of Contents
Fetching ...

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

Tiantian Feng, Xuan Shi, Rahul Gupta, Shrikanth S. Narayanan

TL;DR

TI-ASU tackles robust automatic speech understanding when the speech modality is missing due to privacy or data collection constraints by imputing missing audio with text-to-speech (TTS) synthesis. The framework leverages pre-trained encoders (WavLM and RoBERTa), end-to-end downstream classifiers, and a speech-imputation pipeline that generates synthetic speech from transcripts using multiple TTS models to increase data diversity. Empirical results show substantial gains even when up to $p=95\%$ of training speech is missing, and TI-ASU demonstrates added robustness under dropout-based modality perturbations. The study also evaluates LLM-assisted transcript augmentation, finding mixed results that highlight the importance of generation quality and prompt design for further gains.

Abstract

Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is missing. Moreover, we show that TI-ASU is adaptive to dropout training, improving model robustness in addressing missing speech during inference.

TI-ASU: Toward Robust Automatic Speech Understanding through Text-to-speech Imputation Against Missing Speech Modality

TL;DR

TI-ASU tackles robust automatic speech understanding when the speech modality is missing due to privacy or data collection constraints by imputing missing audio with text-to-speech (TTS) synthesis. The framework leverages pre-trained encoders (WavLM and RoBERTa), end-to-end downstream classifiers, and a speech-imputation pipeline that generates synthetic speech from transcripts using multiple TTS models to increase data diversity. Empirical results show substantial gains even when up to of training speech is missing, and TI-ASU demonstrates added robustness under dropout-based modality perturbations. The study also evaluates LLM-assisted transcript augmentation, finding mixed results that highlight the importance of generation quality and prompt design for further gains.

Abstract

Automatic Speech Understanding (ASU) aims at human-like speech interpretation, providing nuanced intent, emotion, sentiment, and content understanding from speech and language (text) content conveyed in speech. Typically, training a robust ASU model relies heavily on acquiring large-scale, high-quality speech and associated transcriptions. However, it is often challenging to collect or use speech data for training ASU due to concerns such as privacy. To approach this setting of enabling ASU when speech (audio) modality is missing, we propose TI-ASU, using a pre-trained text-to-speech model to impute the missing speech. We report extensive experiments evaluating TI-ASU on various missing scales, both multi- and single-modality settings, and the use of LLMs. Our findings show that TI-ASU yields substantial benefits to improve ASU in scenarios where even up to 95% of training speech is missing. Moreover, we show that TI-ASU is adaptive to dropout training, improving model robustness in addressing missing speech during inference.
Paper Structure (26 sections, 5 figures, 8 tables)

This paper contains 26 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Problem formulation of missing modalities in this work with ASU. The missing speech modality includes cases in training data alone or any data (both training data and testing data).
  • Figure 2: Illustration of edge devices performing ASR services, where text modality is always present.
  • Figure 3: Learning framework of TI-ASU: Imputing missing speech modality with synthetic speech content through text-to-speech transformer models for robust automatic speech understanding.
  • Figure 4: Comparisons among using single TTS generation and multiple generation in TI-ASU. Here, the training set is entirely based on synthetic speech data.
  • Figure 5: Comparisons of multimodal training between GTI-ASU and zero-filling imputation on missing speech. Speech missing ratio in training data $p\in\{50\%, 70\%, 90\%, 95\%\}$