Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Shivam Mehta, Nebojsa Jojic, Hannes Gamper
TL;DR
This work tackles the problem of unifying audio understanding and generation with text-based LLMs by introducing an ultra-low bitrate audio tokenizer built from Variational Quantization and Conditional Flow Matching. By converting audio into discrete tokens at $0.23$ kbps and integrating them via early fusion into a Vicuna-based LLM using LoRA, the approach enables multimodal reasoning within a single model. Empirical results show the VQ+FM tokenizer outperforms deterministic VQ-VAE-based baselines on reconstruction and semantic metrics, while the end-to-end LM-MSN system achieves competitive audio comprehension with state-of-the-art methods but exhibits limited audio generation quality due to tokenization tradeoffs. The findings stress the need for larger, diverse datasets and improved evaluation to advance truly multimodal LLMs that can both understand and generate audio alongside text.
Abstract
Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.
