How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Pooneh Mousavi; Jarod Duret; Salah Zaiem; Luca Della Libera; Artem Ploujnikov; Cem Subakan; Mirco Ravanelli

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Pooneh Mousavi, Jarod Duret, Salah Zaiem, Luca Della Libera, Artem Ploujnikov, Cem Subakan, Mirco Ravanelli

TL;DR

This work tackles the challenge of extracting effective discrete audio tokens from self-supervised models by introducing a four-part architecture: a Tokenizer that quantizes multiple SSL layers via $K$-means, an Informed Layer Selector that uses per-time-step attention with $z_{l,t}$ and $a_{l,t}$ to fuse layer information, an Acoustic Model that leverages the fused representations for discriminative and generative tasks, and a Scalable Vocoder trained with layer dropout to decode arbitrary layer combinations. The scalable vocoder and the attention-based layer fusion enable robust, multi-layer token usage, outperforming vocoders trained on single layers and providing interpretable layer importance across tasks. Results across ASR, SID, ER, SE, and TTS show that task-dependent layer usage emerges, with lower layers often critical for reconstruction and higher layers for semantic content, while cluster size, embedding initialization, and domain of training data influence performance differently by task. The findings support discrete semantic tokens as a viable bridge between audio and language processing and demonstrate practical gains for multi-task audio modeling and potential integration into multimodal LLMs, with future work expanding tasks and multi-speaker vocoding.

Abstract

Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

TL;DR

This work tackles the challenge of extracting effective discrete audio tokens from self-supervised models by introducing a four-part architecture: a Tokenizer that quantizes multiple SSL layers via

-means, an Informed Layer Selector that uses per-time-step attention with

and

to fuse layer information, an Acoustic Model that leverages the fused representations for discriminative and generative tasks, and a Scalable Vocoder trained with layer dropout to decode arbitrary layer combinations. The scalable vocoder and the attention-based layer fusion enable robust, multi-layer token usage, outperforming vocoders trained on single layers and providing interpretable layer importance across tasks. Results across ASR, SID, ER, SE, and TTS show that task-dependent layer usage emerges, with lower layers often critical for reconstruction and higher layers for semantic content, while cluster size, embedding initialization, and domain of training data influence performance differently by task. The findings support discrete semantic tokens as a viable bridge between audio and language processing and demonstrate practical gains for multi-task audio modeling and potential integration into multimodal LLMs, with future work expanding tasks and multi-speaker vocoding.

Abstract

Paper Structure (16 sections, 3 equations, 3 figures, 2 tables)

This paper contains 16 sections, 3 equations, 3 figures, 2 tables.

Introduction
Model Design
Tokenizer
Informed Layer Selector
Acoustic Model
Scalable Vocoder
Experiments
Discriminative Tasks
Generative Tasks
Results
Scalable Vocoder
Layer Analysis
Effect of Number of Clusters
Effect of Embedding Initialization
Out-of-Distribution Generalization
...and 1 more sections

Figures (3)

Figure 1: The proposed method for audio token extraction from SSL models: (A) k-means discretizes the continuous representations of each layer, (B) an attention mechanism merges the discrete layer representations, (C) the mixed representations train acoustic models for discriminative and generative tasks, (D) our scalable vocoder generates waveforms (if needed).
Figure 2: Performance of the Scalable Vocoder (SV) at different layers compared to a Single-Layer Vocoder (SLV). Vocoders and tokenizers are trained using the LJSpeech dataset with 1000 and 2000 centroids.
Figure 3: Attention analysis across various tasks and layers of the discrete WavLM model with in-domain tokenizers.

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

TL;DR

Abstract

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Authors

TL;DR

Abstract

Table of Contents

Figures (3)