Table of Contents
Fetching ...

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

Takaaki Saeki, Soumi Maiti, Shinnosuke Takamichi, Shinji Watanabe, Hiroshi Saruwatari

TL;DR

The proposed reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing, including SpeechBERTScore, which computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths.

Abstract

While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics

TL;DR

The proposed reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing, including SpeechBERTScore, which computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths.

Abstract

While subjective assessments have been the gold standard for evaluating speech generation, there is a growing need for objective metrics that are highly correlated with human subjective judgments due to their cost efficiency. This paper proposes reference-aware automatic evaluation methods for speech generation inspired by evaluation metrics in natural language processing. The proposed SpeechBERTScore computes the BERTScore for self-supervised dense speech features of the generated and reference speech, which can have different sequential lengths. We also propose SpeechBLEU and SpeechTokenDistance, which are computed on speech discrete tokens. The evaluations on synthesized speech show that our method correlates better with human subjective ratings than mel cepstral distortion and a recent mean opinion score prediction model. Also, they are effective in noisy speech evaluation and have cross-lingual applicability.
Paper Structure (20 sections, 8 equations, 3 figures, 10 tables)

This paper contains 20 sections, 8 equations, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Proposed speech evaluation metrics. SpeechBERTScore is computed with dense SSL speech features. Quantizer is used for SpeechBLEU and SpeechTokenDistance. $Z$ and $\hat{Z}$ are SSL features. $U$ and $\hat{U}$ are speech discrete tokens.
  • Figure 2: Analysis of layers in SSL models.
  • Figure 3: Illustration of evaluations with unaliend reference speech.