Table of Contents
Fetching ...

ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

Dongchao Yang, Songxiang Liu, Haohan Guo, Jiankun Zhao, Yuanyuan Wang, Helin Wang, Zeqian Ju, Xubo Liu, Xueyuan Chen, Xu Tan, Xixin Wu, Helen Meng

TL;DR

The paper addresses the challenge of tokenizing audio for language modeling with a focus on low bitrate and semantic richness. It introduces ALMTokenizer, a query-based compression framework that interleaves learnable query tokens with audio frames, uses semantic priors in a 3-layer RVQ, and employs MAE and AR losses within a two-stage training regime to enrich semantic content while maintaining compact token representations. Empirical results across speech, sound, and music demonstrate competitive reconstruction at low bitrate and substantially improved LM-based understanding and generation, including TTS, ASR, and captioning tasks, compared to prior tokenizers. The work advances practical audio language modeling by balancing cross-frame context, semantic information, and computational efficiency, while also outlining limitations and avenues for future refinement and responsible use.

Abstract

Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.

ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling

TL;DR

The paper addresses the challenge of tokenizing audio for language modeling with a focus on low bitrate and semantic richness. It introduces ALMTokenizer, a query-based compression framework that interleaves learnable query tokens with audio frames, uses semantic priors in a 3-layer RVQ, and employs MAE and AR losses within a two-stage training regime to enrich semantic content while maintaining compact token representations. Empirical results across speech, sound, and music demonstrate competitive reconstruction at low bitrate and substantially improved LM-based understanding and generation, including TTS, ASR, and captioning tasks, compared to prior tokenizers. The work advances practical audio language modeling by balancing cross-frame context, semantic information, and computational efficiency, while also outlining limitations and avenues for future refinement and responsible use.

Abstract

Recent advancements in audio language models have underscored the pivotal role of audio tokenization, which converts audio signals into discrete tokens, thereby facilitating the application of language model architectures to the audio domain. In this study, we introduce ALMTokenizer, a novel low-bitrate and semantically rich audio codec tokenizer for audio language models. Prior methods, such as Encodec, typically encode individual audio frames into discrete tokens without considering the use of context information across frames. Unlike these methods, we introduce a novel query-based compression strategy to capture holistic information with a set of learnable query tokens by explicitly modeling the context information across frames. This design not only enables the codec model to capture more semantic information but also encodes the audio signal with fewer token sequences. Additionally, to enhance the semantic information in audio codec models, we introduce the following: (1) A masked autoencoder (MAE) loss, (2) Vector quantization based on semantic priors, and (3) An autoregressive (AR) prediction loss. As a result, ALMTokenizer achieves competitive reconstruction performance relative to state-of-the-art approaches while operating at a lower bitrate. Within the same audio language model framework, ALMTokenizer outperforms previous tokenizers in audio understanding and generation tasks.

Paper Structure

This paper contains 51 sections, 5 equations, 5 figures, 17 tables.

Figures (5)

  • Figure 1: The performance comparison when different types of tokenizer is used for audio modeling. PPL refers to perplexity.
  • Figure 2: The left part illustrates the framework of the previous audio codec, while the right part provides an overview of the proposed ALMTokenizer. $w$ denotes the window size. The details of ALMTokenizer can be found in Section \ref{['query-compression']}.
  • Figure 3: The performance comparison with or without AR loss.
  • Figure 4: The left diagram illustrates the framework of the audio language model, which includes a pre-trained LLM, a LoRA module, and a depth transformer. The audio language model can process both text and audio streaming inputs and generate corresponding text and audio outputs. The right diagram provides details of hierarchical audio modeling.
  • Figure 5: The performance comparison with different window size during inference.