Table of Contents
Fetching ...

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

Shivam Mehta, Nebojsa Jojic, Hannes Gamper

TL;DR

This work tackles the problem of unifying audio understanding and generation with text-based LLMs by introducing an ultra-low bitrate audio tokenizer built from Variational Quantization and Conditional Flow Matching. By converting audio into discrete tokens at $0.23$ kbps and integrating them via early fusion into a Vicuna-based LLM using LoRA, the approach enables multimodal reasoning within a single model. Empirical results show the VQ+FM tokenizer outperforms deterministic VQ-VAE-based baselines on reconstruction and semantic metrics, while the end-to-end LM-MSN system achieves competitive audio comprehension with state-of-the-art methods but exhibits limited audio generation quality due to tokenization tradeoffs. The findings stress the need for larger, diverse datasets and improved evaluation to advance truly multimodal LLMs that can both understand and generate audio alongside text.

Abstract

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens

TL;DR

This work tackles the problem of unifying audio understanding and generation with text-based LLMs by introducing an ultra-low bitrate audio tokenizer built from Variational Quantization and Conditional Flow Matching. By converting audio into discrete tokens at kbps and integrating them via early fusion into a Vicuna-based LLM using LoRA, the approach enables multimodal reasoning within a single model. Empirical results show the VQ+FM tokenizer outperforms deterministic VQ-VAE-based baselines on reconstruction and semantic metrics, while the end-to-end LM-MSN system achieves competitive audio comprehension with state-of-the-art methods but exhibits limited audio generation quality due to tokenization tradeoffs. The findings stress the need for larger, diverse datasets and improved evaluation to advance truly multimodal LLMs that can both understand and generate audio alongside text.

Abstract

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Paper Structure

This paper contains 12 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of audio tokenizer containing frozen autoencoder follow by a causal encoder and a conditional flow matching-based decoder with Diffusion Transformer to reconstruct representations from quantised vectors.
  • Figure 2: Overall pipeline for multimodal LLM
  • Figure 3: Audio quantization performance for held-out datasets in terms of FAD and estimated MOS.