Table of Contents
Fetching ...

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

Rikuto Kotoge, Yuichi Sasaki

TL;DR

This work tackles two core challenges in LLM-based TTS: the need for paired, utterance-level data and the mismatch between utterance-level optimization and token-level pronunciation units. It introduces TKTO, a two-step, data-efficient framework that first estimates token-level importance using KTO-based contrastive LLMs trained on unpaired data, and then optimizes token-level preferences via a Kahneman-Tversky-inspired objective with per-token rewards and a logistic value function. The method yields substantial improvements in Japanese TTS, achieving about a 39% gain in pronunciation accuracy and a 54% reduction in CER, with targeted tokens receiving up to 12.8× stronger rewards. TKTO demonstrates that token-level signals can be learned without token-level annotations and paired data, offering a practical path toward fine-grained pronunciation control and potential applicability to other token-critical generation tasks.

Abstract

Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.

Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech

TL;DR

This work tackles two core challenges in LLM-based TTS: the need for paired, utterance-level data and the mismatch between utterance-level optimization and token-level pronunciation units. It introduces TKTO, a two-step, data-efficient framework that first estimates token-level importance using KTO-based contrastive LLMs trained on unpaired data, and then optimizes token-level preferences via a Kahneman-Tversky-inspired objective with per-token rewards and a logistic value function. The method yields substantial improvements in Japanese TTS, achieving about a 39% gain in pronunciation accuracy and a 54% reduction in CER, with targeted tokens receiving up to 12.8× stronger rewards. TKTO demonstrates that token-level signals can be learned without token-level annotations and paired data, offering a practical path toward fine-grained pronunciation control and potential applicability to other token-critical generation tasks.

Abstract

Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.

Paper Structure

This paper contains 26 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Examples of ambiguity in Japanese. Although (a) and (b) contain the same word, its meaning and reading differ depending on the context.
  • Figure 2: Overview of our TKTO framework. Step 1: we estimate token-level importance weights, constructing two contrastive LLMs. Step 2: we optimize token-level preferences.
  • Figure 3: Average log-likelihood for desirable and undesirable tokens during training. TKTO effectively increases only that of desirable tokens.
  • Figure 4: Token reward analysis. Desirable tokens have a higher reward, while undesirable tokens have a much lower reward, encouraging both weights to increase.
  • Figure 5: Case study of token weight estimation. The tokens of target character have higher weights.
  • ...and 1 more figures