MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong; Kuangwei Chen; Zhaoye Fei; Xiaogui Yang; Ke Chen; Yang Wang; Kexin Huang; Mingshu Chen; Ruixiao Li; Qingyuan Cheng; Shimin Li; Xipeng Qiu

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu

TL;DR

This work introduces CAT, a fully end-to-end, homogeneous Transformer-based architecture for discrete audio tokenization, and MOSS-Audio-Tokenizer, a 1.6B-parameter tokenizer trained on 3 million hours of diverse audio. By jointly optimizing the encoder, RVQ quantizer, decoder, and adversarial components, CAT achieves high-fidelity reconstruction across speech, sound, and music, while supporting streaming, autoregressive generation at variable bitrates. The authors also demonstrate a fully autoregressive TTS system (CAT-TTS) that outperforms prior non-autoregressive and cascaded approaches and show competitive ASR without auxiliary encoders. These results position CAT as a scalable, unified interface for future native audio foundation models, enabling robust audio compression, understanding, and generation at scale.

Abstract

Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

TL;DR

Abstract

Paper Structure (65 sections, 9 equations, 8 figures, 4 tables)

This paper contains 65 sections, 9 equations, 8 figures, 4 tables.

Introduction
Rethinking Discrete Audio Tokenization for Future Audio Foundation Models
Unified Audio Representation.
Simplicity and Scalability.
Causality.
Low Frame Rate and Bitrate Robustness.
Causal Audio Tokenizer with Transformer (CAT)
Homogeneous Architecture for Scalable Audio Tokenization
Fully Transformer-based encoder--decoder.
Scalable residual vector quantization.
Unified Audio Modeling
Semantic Modeling via Audio-to-Text Tasks.
Quantizer Optimization.
Acoustic Modeling via Reconstruction Tasks.
Adversarial Training.
...and 50 more sections

Figures (8)

Figure 1: Audio reconstruction quality comparison.
Figure 2: Architecture of CAT (Causal Audio Tokenizer with Transformer). Both the encoder and decoder are built upon causal Transformers. All components, including the encoder, quantizer, decoder, causal language model, and discriminator, are optimized jointly in an end-to-end manner.
Figure 3: Effect of Progressive Sequence Dropout on fully autoregressive TTS across different bitrates.
Figure 4: Comparison between full end-to-end optimization and partial (stage-wise) optimization for CAT.
Figure 5: Scaling behavior of CAT reconstruction performance with respect to bitrate and model parameters.
...and 3 more figures

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

TL;DR

Abstract

MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)