How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli
TL;DR
This work tackles the challenge of extracting effective discrete audio tokens from self-supervised models by introducing a four-part architecture: a Tokenizer that quantizes multiple SSL layers via $K$-means, an Informed Layer Selector that uses per-time-step attention with $z_{l,t}$ and $a_{l,t}$ to fuse layer information, an Acoustic Model that leverages the fused representations for discriminative and generative tasks, and a Scalable Vocoder trained with layer dropout to decode arbitrary layer combinations. The scalable vocoder and the attention-based layer fusion enable robust, multi-layer token usage, outperforming vocoders trained on single layers and providing interpretable layer importance across tasks. Results across ASR, SID, ER, SE, and TTS show that task-dependent layer usage emerges, with lower layers often critical for reconstruction and higher layers for semantic content, while cluster size, embedding initialization, and domain of training data influence performance differently by task. The findings support discrete semantic tokens as a viable bridge between audio and language processing and demonstrate practical gains for multi-task audio modeling and potential integration into multimodal LLMs, with future work expanding tasks and multi-speaker vocoding.
Abstract
Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.
