GLM-TTS Technical Report

Jiayan Cui; Zhihan Yang; Naihan Li; Jiankun Tian; Xingyu Ma; Yi Zhang; Guangyu Chen; Runxuan Yang; Yuqing Cheng; Yizhi Zhou; Guochen Yu; Xiaotao Gu; Jie Tang

GLM-TTS Technical Report

Jiayan Cui, Zhihan Yang, Naihan Li, Jiankun Tian, Xingyu Ma, Yi Zhang, Guangyu Chen, Runxuan Yang, Yuqing Cheng, Yizhi Zhou, Guochen Yu, Xiaotao Gu, Jie Tang

TL;DR

GLM-TTS introduces a production-oriented, two-stage TTS framework that combines a text-to-token autoregressive model with a token-to-waveform diffusion model, achieving strong results with only 100k hours of data. The approach integrates an optimized Whisper-VQ speech tokenizer, GRPO-based multi-reward reinforcement learning, low-cost LoRA-based voice customization, and a hybrid phoneme–text input plus the Vocos2D vocoder to balance accuracy, expressiveness, and deployability. Experimental results on Seed-TTS-eval and internal benchmarks show competitive pronunciation, timbre fidelity, and emotion expressiveness, with ablations confirming the effectiveness of Phoneme-in and Vocos2D. The work emphasizes practical deployment considerations, providing code and demos to facilitate production-ready speech synthesis and customization at scale.

Abstract

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available at https://github.com/zai-org/GLM-TTS. Real-time speech synthesis demos are provided via Z.ai (audio.z.ai), the Zhipu Qingyan app/web (chatglm.cn).

GLM-TTS Technical Report

TL;DR

Abstract

GLM-TTS Technical Report

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)