SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Minchan Kim; Myeonghun Jeong; Joun Yeop Lee; Nam Soo Kim

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

TL;DR

SegINR simplifies the TTS process by directly converting text sequences into frame-level features by directly converting text sequences into frame-level features using a conditional implicit neural representation (INR).

Abstract

We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, transforming each into a segment of frame-level features using a conditional implicit neural representation (INR). This method, named segment-wise INR (SegINR), models temporal dynamics within each segment and autonomously defines segment boundaries, reducing computational costs. We integrate SegINR into a two-stage TTS framework, using it for semantic token prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate that SegINR outperforms conventional methods in speech quality with computational efficiency.

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 2 figures, 2 tables)

This paper contains 21 sections, 1 equation, 2 figures, 2 tables.

Introduction
Backgrounds
Implicit Neural Representation (INR)
Length Regulation in TTS
Attention-based AR Models shen2018naturalli2019neuralkharitonov2023speakwang2023neural
Transducer kim2023transducechen2021speechdu2024vall
Duration-based NAR Models ren2020fastspeechkim2021conditionalpopov2021grad
Method
Segment-wise Implicit Neural Representation (SegINR)
Application
Semantic Token Prediction
Architecture
Training
Inference
Experiments
...and 6 more sections

Figures (2)

Figure 1: Illustration of SegINR and its application for semantic token prediction: (a) overall concept of SegINR, (b) structure of SegINR, (c) training method for semantic token prediction, (d) inference method for semantic token prediction.
Figure 2: Comparison of the adoption of padded training: (a) and (b) show the probability of $\varnothing$, while (c) and (d) show the probability of $y$ for a fixed $u$.

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

TL;DR

Abstract

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (2)