NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Qinke Ni; Huan Liao; Dekun Chen; Yuxiang Wang; Zhizheng Wu

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Qinke Ni, Huan Liao, Dekun Chen, Yuxiang Wang, Zhizheng Wu

Abstract

While recent text-to-speech (TTS) systems increasingly integrate nonverbal vocalizations (NVs), their evaluations lack standardized metrics and reliable ground-truth references. To bridge this gap, we propose NV-Bench, the first benchmark grounded in a functional taxonomy that treats NVs as communicative acts rather than acoustic artifacts. NV-Bench comprises 1,651 multi-lingual, in-the-wild utterances with paired human reference audio, balanced across 14 NV categories. We introduce a dual-dimensional evaluation protocol: (1) Instruction Alignment, utilizing the proposed paralinguistic character error rate (PCER) to assess controllability, (2) Acoustic Fidelity, measuring the distributional gap to real recordings to assess acoustic realism. We evaluate diverse TTS models and develop two baselines. Experimental results demonstrate a strong correlation between our objective metrics and human perception, establishing NV-Bench as a standardized evaluation framework.

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Abstract

Paper Structure (21 sections, 2 equations, 1 figure, 5 tables)

This paper contains 21 sections, 2 equations, 1 figure, 5 tables.

Introduction
Methods
Multi-lingual NVASR
Model architecture
Data construction and label normalization
Benchmark dataset construction
Data collection
Filtering pipeline
NV-Bench
Experiments
Experiments for multi-lingual NVASR
Experimental setup and baselines
Evaluation metrics
Results
Benchmarking NV-Capable TTS models
...and 6 more sections

Figures (1)

Figure 1: Overview of the NV-Bench. (1) Data Processing: Raw audio is filtered using the Emilia-Pipeline and MiMo-Audio. (2) Multi-lingual NVASR: We train a multi-lingual NVASR model on open-source data with a unified label taxonomy. (3) Evaluation: After human verification, the benchmark is evaluated in instruction alignment and acoustic fidelity dimensions and human subjective ratings.

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Abstract

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Authors

Abstract

Table of Contents

Figures (1)