SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Takaaki Saeki; Soumi Maiti; Shinnosuke Takamichi; Shinji Watanabe; Hiroshi Saruwatari

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

TL;DR

The proposed reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing, including SpeechBERTScore, which computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths.

Abstract

While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

TL;DR

Abstract

Paper Structure (20 sections, 8 equations, 3 figures, 10 tables)

This paper contains 20 sections, 8 equations, 3 figures, 10 tables.

Introduction
Method
SpeechBERTScore
SpeechBLEU
SpeechTokenDistance
Experimental evaluations
Experimental settings
Evaluation criteria
Dataset
Self-supervised pretrained models
Baselines
Main results
Ablation study of speech-token-based metrics
Layer-wise analysis
Model-wise analysis
...and 5 more sections

Figures (3)

Figure 1: Proposed speech evaluation metrics. SpeechBERTScore is computed with dense SSL speech features. Quantizer is used for SpeechBLEU and SpeechTokenDistance. $Z$ and $\hat{Z}$ are SSL features. $U$ and $\hat{U}$ are speech discrete tokens.
Figure 2: Analysis of layers in SSL models.
Figure 3: Illustration of evaluations with unaliend reference speech.

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

TL;DR

Abstract

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Authors

TL;DR

Abstract

Table of Contents

Figures (3)