Scaling Speech Tokenizers with Diffusion Autoencoders

Yuancheng Wang; Zhenyu Tang; Yun Wang; Arthur Hinsvark; Yingru Liu; Yinghao Li; Kainan Peng; Junyi Ao; Mingbo Ma; Mike Seltzer; Qing He; Xubo Liu

Scaling Speech Tokenizers with Diffusion Autoencoders

Yuancheng Wang, Zhenyu Tang, Yun Wang, Arthur Hinsvark, Yingru Liu, Yinghao Li, Kainan Peng, Junyi Ao, Mingbo Ma, Mike Seltzer, Qing He, Xubo Liu

TL;DR

SiTok tackles the challenge of jointly compressing speech signals, reconstructing high-fidelity audio, and preserving semantic content for speech understanding by learning a diffusion-based, end-to-end tokenizer. It integrates a diffusion decoder with a vector-quantized latent space and a semantic regularization signal via a CTC loss, enabling extremely low token rates ($12.5$ Hz) and low bitrates ($0.2$ kbps) while scaling to $1.6$B parameters and training on $2$ million hours. The method achieves competitive reconstruction quality and state-of-the-art understanding tasks, and supports zero-shot speech generation with efficient autoregressive TTS using the learned tokens. Practically, SiTok provides a unified representation that improves scalability for speech language models and enables faster generation due to compact token sequences.

Abstract

Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.

Scaling Speech Tokenizers with Diffusion Autoencoders

TL;DR

Hz) and low bitrates (

kbps) while scaling to

B parameters and training on

million hours. The method achieves competitive reconstruction quality and state-of-the-art understanding tasks, and supports zero-shot speech generation with efficient autoregressive TTS using the learned tokens. Practically, SiTok provides a unified representation that improves scalability for speech language models and enables faster generation due to compact token sequences.

Abstract

Hz and a bit-rate of 200 bits-per-second.

Paper Structure (29 sections, 5 equations, 2 figures, 8 tables)

This paper contains 29 sections, 5 equations, 2 figures, 8 tables.

Introduction
Method
Overview
Semantic Regularization
Efficient Decoding
Reconstruction Refinement
Experiments and Results
Implementation Details
Evaluation
Results and Comparison
Main Results for Reconstruction
Downstream Understanding
Effectiveness of Semantic Regularization
Effectiveness of Model Scaling
Efficient Decoding
...and 14 more sections

Figures (2)

Figure 1: Overview of SiTok.
Figure 2: Impact of shortcut fine-tuning on different inference steps. We report WER, SIM, and UTMOS. Shortcut fine-tuning achieves consistently better intelligibility and similarity, especially at small step numbers.

Scaling Speech Tokenizers with Diffusion Autoencoders

TL;DR

Abstract

Scaling Speech Tokenizers with Diffusion Autoencoders

Authors

TL;DR

Abstract

Table of Contents

Figures (2)