Conversational Speech Naturalness Predictor

Anfeng Xu; Yashesh Gaur; Naoyuki Kanda; Zhicheng Ouyang; Katerina Zmolikova; Desh Raj; Simone Merello; Anna Sun; Ozlem Kalinli

Conversational Speech Naturalness Predictor

Anfeng Xu, Yashesh Gaur, Naoyuki Kanda, Zhicheng Ouyang, Katerina Zmolikova, Desh Raj, Simone Merello, Anna Sun, Ozlem Kalinli

TL;DR

This paper presents a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations, and proposes a dual-channel naturalness estimator, in which multiple pre-trained encoders with data augmentation are investigated.

Abstract

Evaluation of conversational naturalness is essential for developing human-like speech agents. However, existing speech naturalness predictors are often designed to assess utterances from a single speaker, failing to capture conversation-level naturalness qualities. In this paper, we present a framework for an automatic naturalness predictor for two-speaker, multi-turn conversations. We first show that existing naturalness estimators have low, or sometimes even negative, correlations with conversational naturalness, based on conversational recordings annotated with human ratings. We then propose a dual-channel naturalness estimator, in which we investigate multiple pre-trained encoders with data augmentation. Our proposed model achieves substantially higher correlation with human judgments compared to existing naturalness predictors for both in-domain and out-of-domain conditions.

Conversational Speech Naturalness Predictor

TL;DR

Abstract

Paper Structure (24 sections, 3 figures, 7 tables)

This paper contains 24 sections, 3 figures, 7 tables.

Introduction
Related Works
Existing Speech Naturalness Predictors
Conversational Speech Generation and Evaluation
Dataset
Overview
Conversational TTS (ConvTTS) Dataset
Fully Duplex Conversational (FDX-Conv) Dataset
Existing naturalness predictors
Method
Modeling
Training and Evaluation Data
Data Augmentation
Model and Training Configurations
Results and Analysis
...and 9 more sections

Figures (3)

Figure 1: Scatter plots with UTMOSv2 mean pooling methods.
Figure 2: Modeling architecture with dual-channel input. For the single-channle input, we only use the system-channel input.
Figure 3: Scatter plots of Whisper results. Dual-channel input for ConvTTS and Single-channel input for FDX-Conv.

Conversational Speech Naturalness Predictor

TL;DR

Abstract

Conversational Speech Naturalness Predictor

Authors

TL;DR

Abstract

Table of Contents

Figures (3)