Table of Contents
Fetching ...

Multi-Task Learning for Front-End Text Processing in TTS

Wonjune Kang, Yun Wang, Shun Zhang, Arthur Hinsvark, Qing He

TL;DR

The paper proposes a unified multi-task learning framework for TN, POS tagging, and HD in a TTS front-end, using a two-stream trunk that fuses token-level TN features with ALBERT embeddings via cross-attention. It demonstrates that training on all three tasks yields the strongest overall performance, providing empirical evidence of positive transfer among tasks. A balanced, Llama 2–generated HD dataset is introduced to address pronunciation data imbalance and improve HD evaluation and training, yielding substantial gains on the HD task. The work offers practical insights for building more cohesive and robust TTS front-ends and highlights data-balancing as a key factor in HD performance.

Abstract

We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech (TTS) front-end: text normalization (TN), part-of-speech (POS) tagging, and homograph disambiguation (HD). Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads. We further incorporate a pre-trained language model to utilize its built-in lexical and contextual knowledge, and study how to best use its embeddings so as to most effectively benefit our multi-task model. Through task-wise ablations, we show that our full model trained on all three tasks achieves the strongest overall performance compared to models trained on individual or sub-combinations of tasks, confirming the advantages of our MTL framework. Finally, we introduce a new HD dataset containing a balanced number of sentences in diverse contexts for a variety of homographs and their pronunciations. We demonstrate that incorporating this dataset into training significantly improves HD performance over only using a commonly used, but imbalanced, pre-existing dataset.

Multi-Task Learning for Front-End Text Processing in TTS

TL;DR

The paper proposes a unified multi-task learning framework for TN, POS tagging, and HD in a TTS front-end, using a two-stream trunk that fuses token-level TN features with ALBERT embeddings via cross-attention. It demonstrates that training on all three tasks yields the strongest overall performance, providing empirical evidence of positive transfer among tasks. A balanced, Llama 2–generated HD dataset is introduced to address pronunciation data imbalance and improve HD evaluation and training, yielding substantial gains on the HD task. The work offers practical insights for building more cohesive and robust TTS front-ends and highlights data-balancing as a key factor in HD performance.

Abstract

We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech (TTS) front-end: text normalization (TN), part-of-speech (POS) tagging, and homograph disambiguation (HD). Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads. We further incorporate a pre-trained language model to utilize its built-in lexical and contextual knowledge, and study how to best use its embeddings so as to most effectively benefit our multi-task model. Through task-wise ablations, we show that our full model trained on all three tasks achieves the strongest overall performance compared to models trained on individual or sub-combinations of tasks, confirming the advantages of our MTL framework. Finally, we introduce a new HD dataset containing a balanced number of sentences in diverse contexts for a variety of homographs and their pronunciations. We demonstrate that incorporating this dataset into training significantly improves HD performance over only using a commonly used, but imbalanced, pre-existing dataset.
Paper Structure (15 sections, 1 figure, 2 tables)

This paper contains 15 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Block diagram of the proposed multi-task model for TN, POS tagging, and HD. The shared trunk processes the input text in two streams, which are combined using cross-attention. The shared representations are then passed to separate heads that solve each task.