Table of Contents
Fetching ...

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

Ryosuke Sonobe, Shinnosuke Takamichi, Hiroshi Saruwatari

TL;DR

The paper presents the JSUT corpus, a free, large-scale 10-hour Japanese speech dataset designed to support end-to-end text-to-speech research. It achieves broad pronunciation coverage by assembling nine specialized sub-corpora that collectively address daily-use kanji readings, numerals with counters, loanwords, paraphrases, voice-actress para-speech, onomatopoeia, and domain-specific content. The authors describe recording details (48 kHz, 16-bit WAV, UTF-8 transcripts) and provide initial analyses of utterance lengths and pitch statistics, validating diversity across days. The JSUT corpus enables research across academia and industry by providing a practical, multi-domain resource for building Japanese end-to-end TTS systems.

Abstract

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

TL;DR

The paper presents the JSUT corpus, a free, large-scale 10-hour Japanese speech dataset designed to support end-to-end text-to-speech research. It achieves broad pronunciation coverage by assembling nine specialized sub-corpora that collectively address daily-use kanji readings, numerals with counters, loanwords, paraphrases, voice-actress para-speech, onomatopoeia, and domain-specific content. The authors describe recording details (48 kHz, 16-bit WAV, UTF-8 transcripts) and provide initial analyses of utterance lengths and pitch statistics, validating diversity across days. The JSUT corpus enables research across academia and industry by providing a practical, multi-domain resource for building Japanese end-to-end TTS systems.

Abstract

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

Paper Structure

This paper contains 16 sections, 3 figures.

Figures (3)

  • Figure 1: Histogram of number of moras (sub-syllables) in one utterance. Minimum, mean, and maximum values are 7, 37.14, and 133, respectively.
  • Figure 2: Histogram of number of words in one utterance. Minimum, mean, and maximum values are 2, 18.03, and 70, respectively.
  • Figure 3: Mean of log-scaled F0 for each recording day. Ordinal number of x-axis means how much time passed from "1st" recording day. For example, "5th" means 4 days after 1st recording day.