Table of Contents
Fetching ...

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

Minchan Kim, Myeonghun Jeong, Joun Yeop Lee, Nam Soo Kim

TL;DR

SegINR simplifies the TTS process by directly converting text sequences into frame-level features by directly converting text sequences into frame-level features using a conditional implicit neural representation (INR).

Abstract

We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, transforming each into a segment of frame-level features using a conditional implicit neural representation (INR). This method, named segment-wise INR (SegINR), models temporal dynamics within each segment and autonomously defines segment boundaries, reducing computational costs. We integrate SegINR into a two-stage TTS framework, using it for semantic token prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate that SegINR outperforms conventional methods in speech quality with computational efficiency.

SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech

TL;DR

SegINR simplifies the TTS process by directly converting text sequences into frame-level features by directly converting text sequences into frame-level features using a conditional implicit neural representation (INR).

Abstract

We present SegINR, a novel approach to neural Text-to-Speech (TTS) that addresses sequence alignment without relying on an auxiliary duration predictor and complex autoregressive (AR) or non-autoregressive (NAR) frame-level sequence modeling. SegINR simplifies the process by converting text sequences directly into frame-level features. It leverages an optimal text encoder to extract embeddings, transforming each into a segment of frame-level features using a conditional implicit neural representation (INR). This method, named segment-wise INR (SegINR), models temporal dynamics within each segment and autonomously defines segment boundaries, reducing computational costs. We integrate SegINR into a two-stage TTS framework, using it for semantic token prediction. Our experiments in zero-shot adaptive TTS scenarios demonstrate that SegINR outperforms conventional methods in speech quality with computational efficiency.
Paper Structure (21 sections, 1 equation, 2 figures, 2 tables)

This paper contains 21 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Illustration of SegINR and its application for semantic token prediction: (a) overall concept of SegINR, (b) structure of SegINR, (c) training method for semantic token prediction, (d) inference method for semantic token prediction.
  • Figure 2: Comparison of the adoption of padded training: (a) and (b) show the probability of $\varnothing$, while (c) and (d) show the probability of $y$ for a fixed $u$.