Table of Contents
Fetching ...

Conversational Speech Naturalness Predictor

Anfeng Xu, Yashesh Gaur, Naoyuki Kanda, Zhicheng Ouyang, Katerina Zmolikova, Desh Raj, Simone Merello, Anna Sun, Ozlem Kalinli

TL;DR

This paper presents a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations, and proposes a dual-channel naturalness estimator, in which multiple pre-trained encoders with data augmentation are investigated.

Abstract

Evaluation of conversational naturalness is essential for developing human-like speech agents. However, existing speech naturalness predictors are often designed to assess utterances from a single speaker, failing to capture conversation-level naturalness qualities. In this paper, we present a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations. We first show that existing naturalness estimators have low, or sometimes even negative, correlations with conversational naturalness, based on conversational recordings annotated with human ratings. We then propose a dual-channel naturalness estimator, in which we investigate multiple pre-trained encoders with data augmentation. Our proposed model achieves substantially higher correlation with human judgments compared to existing naturalness predictors for both in-domain and out-of-domain conditions.

Conversational Speech Naturalness Predictor

TL;DR

This paper presents a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations, and proposes a dual-channel naturalness estimator, in which multiple pre-trained encoders with data augmentation are investigated.

Abstract

Evaluation of conversational naturalness is essential for developing human-like speech agents. However, existing speech naturalness predictors are often designed to assess utterances from a single speaker, failing to capture conversation-level naturalness qualities. In this paper, we present a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations. We first show that existing naturalness estimators have low, or sometimes even negative, correlations with conversational naturalness, based on conversational recordings annotated with human ratings. We then propose a dual-channel naturalness estimator, in which we investigate multiple pre-trained encoders with data augmentation. Our proposed model achieves substantially higher correlation with human judgments compared to existing naturalness predictors for both in-domain and out-of-domain conditions.
Paper Structure (24 sections, 3 figures, 7 tables)

This paper contains 24 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Scatter plots with UTMOSv2 mean pooling methods.
  • Figure 2: Modeling architecture with dual-channel input. For the single-channle input, we only use the system-channel input.
  • Figure 3: Scatter plots of Whisper results. Dual-channel input for ConvTTS and Single-channel input for FDX-Conv.