Table of Contents
Fetching ...

Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech

Dong Yang, Tomoki Koriyama, Yuki Saito

TL;DR

This work tackles the costly annotation burden of breath positions for natural-sounding TTS by introducing a self-training framework for frame-wise breath detection. It combines rule-based annotation with iterative pseudo-labeling and a Conformer-based detector that integrates down-/up-sampling to achieve high temporal resolution, aided by features such as ZCR and VMS. The approach yields superior breath-detection performance over a CNN-based baseline and, when used to insert breath marks, improves the naturalness of breath in multi-speaker TTS, even enabling breath synthesis for speakers lacking breath data. Practically, the method reduces annotation requirements while enhancing breath realism in synthetic speech, with implications for TTS, corpus construction, and related speech technologies.

Abstract

Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and involves: 1) annotation of limited breath sounds utilizing a rule-based approach, and 2) iterative augmentation of these annotations through pseudo-labeling based on the model's predictions. Our detection model employs Conformer blocks with down-/up-sampling layers, enabling accurate frame-wise breath detection. We investigate its effectiveness in multi-speaker TTS using text transcripts with detected breath marks. The results indicate that using our proposed model for breath detection and breath mark insertion synthesizes breath-contained speech more naturally than a baseline model.

Frame-Wise Breath Detection with Self-Training: An Exploration of Enhancing Breath Naturalness in Text-to-Speech

TL;DR

This work tackles the costly annotation burden of breath positions for natural-sounding TTS by introducing a self-training framework for frame-wise breath detection. It combines rule-based annotation with iterative pseudo-labeling and a Conformer-based detector that integrates down-/up-sampling to achieve high temporal resolution, aided by features such as ZCR and VMS. The approach yields superior breath-detection performance over a CNN-based baseline and, when used to insert breath marks, improves the naturalness of breath in multi-speaker TTS, even enabling breath synthesis for speakers lacking breath data. Practically, the method reduces annotation requirements while enhancing breath realism in synthetic speech, with implications for TTS, corpus construction, and related speech technologies.

Abstract

Developing Text-to-Speech (TTS) systems that can synthesize natural breath is essential for human-like voice agents but requires extensive manual annotation of breath positions in training data. To this end, we propose a self-training method for training a breath detection model that can automatically detect breath positions in speech. Our method trains the model using a large speech corpus and involves: 1) annotation of limited breath sounds utilizing a rule-based approach, and 2) iterative augmentation of these annotations through pseudo-labeling based on the model's predictions. Our detection model employs Conformer blocks with down-/up-sampling layers, enabling accurate frame-wise breath detection. We investigate its effectiveness in multi-speaker TTS using text transcripts with detected breath marks. The results indicate that using our proposed model for breath detection and breath mark insertion synthesizes breath-contained speech more naturally than a baseline model.
Paper Structure (10 sections, 1 equation, 2 figures, 4 tables, 1 algorithm)

This paper contains 10 sections, 1 equation, 2 figures, 4 tables, 1 algorithm.

Figures (2)

  • Figure 1: VMS curve within a pause segment.
  • Figure 2: Architecture of proposed model.