Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Rikuto Kotoge, Yuichi Sasaki
TL;DR
This work tackles two core challenges in LLM-based TTS: the need for paired, utterance-level data and the mismatch between utterance-level optimization and token-level pronunciation units. It introduces TKTO, a two-step, data-efficient framework that first estimates token-level importance using KTO-based contrastive LLMs trained on unpaired data, and then optimizes token-level preferences via a Kahneman-Tversky-inspired objective with per-token rewards and a logistic value function. The method yields substantial improvements in Japanese TTS, achieving about a 39% gain in pronunciation accuracy and a 54% reduction in CER, with targeted tokens receiving up to 12.8× stronger rewards. TKTO demonstrates that token-level signals can be learned without token-level annotations and paired data, offering a practical path toward fine-grained pronunciation control and potential applicability to other token-critical generation tasks.
Abstract
Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models. Current approaches primarily require paired desirable and undesirable samples at the utterance level. However, such pairs are often limited in TTS output data, and utterance-level formulation prevents fine-grained token-level optimization needed for accurate pronunciation alignment. In this study, we propose TKTO that eliminates the need for paired data, enabling a more data-efficient training paradigm, and directly targets token-level units, automatically providing fine-grained alignment signals without token-level annotations. TKTO improves the challenging Japanese TTS accuracy by 39% and reduces CER by 54%, automatically assigning 12.8 times stronger reward to targeted tokens.
