Table of Contents
Fetching ...

Uni-VERSA: Versatile Speech Assessment with a Unified Network

Jiatong Shi, Hye-Jin Shim, Shinji Watanabe

TL;DR

Speech quality assessment has relied on subjective MOS tests, which are costly and scale poorly. Uni-VERSA offers a unified network that predicts multiple objective metrics across five domains—noise level, prosody, naturalness, intelligibility, and speaker characteristics—within a single framework. The work formalizes the framework and evaluation protocol, introduces a URGENT24-based benchmark, and demonstrates that Uni-VERSA correlates well with human judgments while delivering large efficiency gains over single-metric evaluation. Through extensive experiments and out-of-domain tests, the approach shows promise for scalable, comprehensive speech quality assessment in enhancement, synthesis, and conversational systems, with potential applicability to broader audio domains.

Abstract

Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metrics, encompassing naturalness, intelligibility, speaker characteristics, prosody, and noise, for a comprehensive evaluation of speech signals. We formalize its framework, evaluation protocol, and applications in speech enhancement, synthesis, and quality control. A benchmark based on the URGENT24 challenge, along with a baseline leveraging self-supervised representations, demonstrates that Uni-VERSA provides a viable alternative to single-aspect evaluation methods. Moreover, it aligns closely with human perception, making it a promising approach for future speech quality assessment.

Uni-VERSA: Versatile Speech Assessment with a Unified Network

TL;DR

Speech quality assessment has relied on subjective MOS tests, which are costly and scale poorly. Uni-VERSA offers a unified network that predicts multiple objective metrics across five domains—noise level, prosody, naturalness, intelligibility, and speaker characteristics—within a single framework. The work formalizes the framework and evaluation protocol, introduces a URGENT24-based benchmark, and demonstrates that Uni-VERSA correlates well with human judgments while delivering large efficiency gains over single-metric evaluation. Through extensive experiments and out-of-domain tests, the approach shows promise for scalable, comprehensive speech quality assessment in enhancement, synthesis, and conversational systems, with potential applicability to broader audio domains.

Abstract

Subjective listening tests remain the golden standard for speech quality assessment, but are costly, variable, and difficult to scale. In contrast, existing objective metrics, such as PESQ, F0 correlation, and DNSMOS, typically capture only specific aspects of speech quality. To address these limitations, we introduce Uni-VERSA, a unified network that simultaneously predicts various objective metrics, encompassing naturalness, intelligibility, speaker characteristics, prosody, and noise, for a comprehensive evaluation of speech signals. We formalize its framework, evaluation protocol, and applications in speech enhancement, synthesis, and quality control. A benchmark based on the URGENT24 challenge, along with a baseline leveraging self-supervised representations, demonstrates that Uni-VERSA provides a viable alternative to single-aspect evaluation methods. Moreover, it aligns closely with human perception, making it a promising approach for future speech quality assessment.

Paper Structure

This paper contains 12 sections, 2 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The architecture of the Uni-VERSA base model with two metrics as an example. Detailed implementation is discussed in Sec. \ref{['ssec: base model']}.