Table of Contents
Fetching ...

Scaling Auditory Cognition via Test-Time Compute in Audio Language Models

Ting Dang, Yan Gao, Hong Jia

TL;DR

This work assesses how audio large language models handle real-world auditory cognition and demonstrates that performance degrades with increasing task difficulty. It introduces five test-time compute strategies, combining chain-of-thought prompting and verifier-based decoding, to boost inference without retraining. Across five Audio LLMs, results show substantial TTC-driven gains, with GPT-4o achieving near-human or superhuman performance in complex scenes, though gains are model- and task-dependent. The findings highlight TTC as a practical route to more robust auditory processing in assistive listening, voice assistants, and communication tech, while pointing to future needs in dataset diversity and audio-specific reward modeling.

Abstract

Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textit{self-collected} database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.

Scaling Auditory Cognition via Test-Time Compute in Audio Language Models

TL;DR

This work assesses how audio large language models handle real-world auditory cognition and demonstrates that performance degrades with increasing task difficulty. It introduces five test-time compute strategies, combining chain-of-thought prompting and verifier-based decoding, to boost inference without retraining. Across five Audio LLMs, results show substantial TTC-driven gains, with GPT-4o achieving near-human or superhuman performance in complex scenes, though gains are model- and task-dependent. The findings highlight TTC as a practical route to more robust auditory processing in assistive listening, voice assistants, and communication tech, while pointing to future needs in dataset diversity and audio-specific reward modeling.

Abstract

Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textit{self-collected} database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.

Paper Structure

This paper contains 35 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Data collection and experimental design. Three tasks with increasing difficulty in auditory processing were designed, and both humans and LLMs were prompted to answer the same questions after the audio played.
  • Figure 2: Search against verifier approaches: i) Self-consistency decoding, which applies a majority vote to the $N$ multiple outputs; ii) Best-of-N sampling with beam search, where the selected beams (outputs) are weighted by their log-likelihood; and iii) LLM verifier, which employs a stronger LLM as the reward model to score the multiple outputs, ranking or weighting them to optimize the final output.
  • Figure 3: Performance of audio LLMs (without TTC) in comparison to human perception
  • Figure 4: Performance trend from simple to complex acoustic scenes.
  • Figure 5: Evaluating the effect of CoT on model performance. The improvement with CoT varies depending on the specific models and tasks.
  • ...and 2 more figures