Table of Contents
Fetching ...

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

Detai Xin, Junfeng Jiang, Shinnosuke Takamichi, Yuki Saito, Akiko Aizawa, Hiroshi Saruwatari

TL;DR

JVNV tackles the lack of Japanese emotional speech corpora featuring nonverbal vocalizations by automatically generating emotion-specific scripts with NV phrases via prompt-engineered LLMs. A two-stage selection yields 514 scripts (356 core + 158 extra) used to record about 3.94 hours of speech from four native speakers across six emotions, with NV durations annotated. Technical validation shows improved phoneme coverage and emotion recognizability relative to prior corpora, and a TTS benchmark using discrete NV codes highlights the ongoing difficulty of synthesizing emotional speech with NVs. Overall, JVNV demonstrates a scalable approach to incorporating NVs in expressive Japanese speech and provides a valuable resource for SER and TTS research.

Abstract

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.

JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions

TL;DR

JVNV tackles the lack of Japanese emotional speech corpora featuring nonverbal vocalizations by automatically generating emotion-specific scripts with NV phrases via prompt-engineered LLMs. A two-stage selection yields 514 scripts (356 core + 158 extra) used to record about 3.94 hours of speech from four native speakers across six emotions, with NV durations annotated. Technical validation shows improved phoneme coverage and emotion recognizability relative to prior corpora, and a TTS benchmark using discrete NV codes highlights the ongoing difficulty of synthesizing emotional speech with NVs. Overall, JVNV demonstrates a scalable approach to incorporating NVs in expressive Japanese speech and provides a valuable resource for SER and TTS research.

Abstract

We present the JVNV, a Japanese emotional speech corpus with verbal content and nonverbal vocalizations whose scripts are generated by a large-scale language model. Existing emotional speech corpora lack not only proper emotional scripts but also nonverbal vocalizations (NVs) that are essential expressions in spoken language to express emotions. We propose an automatic script generation method to produce emotional scripts by providing seed words with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using prompt engineering. We select 514 scripts with balanced phoneme coverage from the generated candidate scripts with the assistance of emotion confidence scores and language fluency scores. We demonstrate the effectiveness of JVNV by showing that JVNV has better phoneme coverage and emotion recognizability than previous Japanese emotional speech corpora. We then benchmark JVNV on emotional text-to-speech synthesis using discrete codes to represent NVs. We show that there still exists a gap between the performance of synthesizing read-aloud speech and emotional speech, and adding NVs in the speech makes the task even harder, which brings new challenges for this task and makes JVNV a valuable resource for relevant works in the future. To our best knowledge, JVNV is the first speech corpus that generates scripts automatically using large language models.
Paper Structure (22 sections, 1 equation, 4 figures, 5 tables)

This paper contains 22 sections, 1 equation, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed emotional script generation method with NV phrases. Here we use happiness as the emotion, interesting as the seed word, and haha as the NV phrase. Note we use the word "interjection" to replace NV so that ChatGPT can understand this concept.
  • Figure 2: Prompt template for script generation. We use happiness as an example. Texts embraced by [] are replaced by proper content during script generation. Texts starting with # are comments. We use $n=3$ demonstrations during the script generation. We provide one of them as an example. The English translation is also attached.
  • Figure 3: The number of unique arrangements of consecutive phoneme(s) of each corpus.
  • Figure 4: The proposed TTS method uses codes to represent NVs. Codes and phonemes are concatenated together and fed to the TTS model.