JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

Ryosuke Sonobe; Shinnosuke Takamichi; Hiroshi Saruwatari

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

Ryosuke Sonobe, Shinnosuke Takamichi, Hiroshi Saruwatari

TL;DR

The paper presents the JSUT corpus, a free, large-scale 10-hour Japanese speech dataset designed to support end-to-end text-to-speech research. It achieves broad pronunciation coverage by assembling nine specialized sub-corpora that collectively address daily-use kanji readings, numerals with counters, loanwords, paraphrases, voice-actress para-speech, onomatopoeia, and domain-specific content. The authors describe recording details (48 kHz, 16-bit WAV, UTF-8 transcripts) and provide initial analyses of utterance lengths and pitch statistics, validating diversity across days. The JSUT corpus enables research across academia and industry by providing a practical, multi-domain resource for building Japanese end-to-end TTS systems.

Abstract

Thanks to improvements in machine learning techniques including deep learning, a free large-scale speech corpus that can be shared between academic institutions and commercial companies has an important role. However, such a corpus for Japanese speech synthesis does not exist. In this paper, we designed a novel Japanese speech corpus, named the "JSUT corpus," that is aimed at achieving end-to-end speech synthesis. The corpus consists of 10 hours of reading-style speech data and its transcription and covers all of the main pronunciations of daily-use Japanese characters. In this paper, we describe how we designed and analyzed the corpus. The corpus is freely available online.

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

TL;DR

Abstract

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)