Towards Audio Token Compression in Large Audio Language Models
Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass
TL;DR
The paper tackles the scalability bottlenecks of large audio language models by introducing audio token compression prior to the LLM backbone, using unsupervised segmentation, uniform averaging, and uniform downsampling, with LoRA-based realignment to mitigate representation misalignment. Through experiments on ASR and speech-to-speech translation across LibriSpeech, CommonVoice, and CoVoST, the authors demonstrate that compressed LALMs can achieve performance close to frame-level models while reducing input tokens by up to threefold. The approach reveals favorable trade-offs with pooling-based compression over sampling, and shows that cross-language and cross-task transfer benefits can be gained with careful data selection and task-specific fine-tuning. The work enables more scalable, long-form audio processing on resource-constrained devices, with future directions including non-LoRA rebalancing methods and additional audio understanding benchmarks.
Abstract
Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM's audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.
