Table of Contents
Fetching ...

T5Gemma-TTS Technical Report

Chihiro Arata, Kiyoshi Kurihara

Abstract

Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.

T5Gemma-TTS Technical Report

Abstract

Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.

Paper Structure

This paper contains 40 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overall architecture of T5Gemma-TTS. The T5Gemma encoder processes input text bidirectionally and produces contextualized representations, which are injected into every decoder layer via PM-RoPE cross-attention. The decoder autoregressively generates XCodec2 audio tokens conditioned on both the encoder output and a reference speech prompt.
  • Figure 2: Intelligibility (CER/WER) across all six test sets (JSUT/JA, AISHELL-1/ZH, LibriSpeech/EN$^\dagger$, FLEURS/KO, FLEURS/FR, FLEURS/DE) and five systems. CER is used for JA/ZH/KO; WER for EN/FR/DE; lower is better. F5-TTS shows near-complete intelligibility failure on Japanese (CER > 1.0). $^\dagger$T5Gemma-TTS EN results are upper-bound estimates (LibriHeavy training/test overlap).
  • Figure 3: SIM across all six test sets and five systems. Higher is better. T5Gemma-TTS achieves the highest SIM on Japanese (statistically supported; CI non-overlapping with XTTS v2) and numerically highest on Korean (CI overlapping with XTTS v2; not conclusive). $\dagger$ Kokoro SIM reflects a preset voice, not the reference speaker.
  • Figure 4: Heatmap of SIM (left) and UTMOS (right) across all five systems and six languages. Darker shading indicates better performance. Kokoro achieves high UTMOS but near-zero SIM (no voice cloning capability). T5Gemma-TTS shows the strongest SIM on Japanese and Korean among zero-shot systems.
  • Figure 5: Radar chart of normalized multi-metric averages (6 languages). Intelligibility $= 1 - \mathrm{CER/WER}$ (capped at 0 for values $>1$); UTMOS normalized to $[0,1]$ via $(x-1)/4$. Caveat: the normalization compresses CER and WER into the same $[0,1]$ scale across languages with different phonetic and orthographic properties; cross-lingual comparisons within this chart should be treated as qualitative trends rather than precise numerical rankings.
  • ...and 1 more figures