A Language Modeling Approach to Diacritic-Free Hebrew TTS
Amit Roth, Arnon Turetzky, Yossi Adi
TL;DR
Hebrew TTS faces pronunciation ambiguity without diacritics. The authors propose a language-model that operates on discrete speech tokens and uses word-piece text tokens, conditioned on acoustic prompts, to synthesize diacritics-free Hebrew speech. They train on about 4500 hours of weakly labeled data from ivrit.ai and HebDB, and show improvements over diacritics-predicting baselines MMS and Overflow in both objective metrics and human judgments, with tokenization by word-piece outperforming character-level approaches. The approach demonstrates scalable, context-aware TTS for Hebrew with strong cross-speaker generalization, though autoregressive latency and occasional word omissions remain limitations. Code and datasets are publicly available to facilitate further research.
Abstract
We tackle the task of text-to-speech (TTS) in Hebrew. Traditional Hebrew contains Diacritics, which dictate the way individuals should pronounce given words, however, modern Hebrew rarely uses them. The lack of diacritics in modern Hebrew results in readers expected to conclude the correct pronunciation and understand which phonemes to use based on the context. This imposes a fundamental challenge on TTS systems to accurately map between text-to-speech. In this work, we propose to adopt a language modeling Diacritics-Free approach, for the task of Hebrew TTS. The model operates on discrete speech representations and is conditioned on a word-piece tokenizer. We optimize the proposed method using in-the-wild weakly supervised data and compare it to several diacritic-based TTS systems. Results suggest the proposed method is superior to the evaluated baselines considering both content preservation and naturalness of the generated speech. Samples can be found under the following link: pages.cs.huji.ac.il/adiyoss-lab/HebTTS/
