How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu; Szu-Wei Fu; Chao-Han Huck Yang; Zhehuai Chen; Sung-Feng Huang; Chih-Kai Yang; Yi-Cheng Lin; Chi-Yuan Hsiao; Wenze Ren; En-Pei Hu; Yu-Han Huang; An-Yu Cheng; Cheng-Han Chiang; Yu Tsao; Yu-Chiang Frank Wang; Hung-yi Lee

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee

Abstract

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Abstract

Paper Structure (20 sections, 3 figures, 4 tables)

This paper contains 20 sections, 3 figures, 4 tables.

Introduction
Related Work
Audio Understanding Systems
Evaluating Auditory Knowledge and Capabilities
Method
Text-only Auditory Knowledge Benchmark Evaluation
Text-only Cascade Evaluation
Audio-Grounded Evaluation via End-to-End Fine-Tuning
Experimental Setup
Evaluated LLMs
Fine-Tuning Configuration
Evaluation and Inference Setup
Results
Overall Trend
Results on Auditory Knowledge Benchmark Evaluation
...and 5 more sections

Figures (3)

Figure 1: Overview of the three evaluations introduced in this work. (Top) AKB-2000 construction pipeline: a two-level taxonomy guides LLM-assisted question generation, followed by human verification. (Middle) Cascade evaluation: a captioner converts audio to text descriptions fed to a text-only LLM. (Bottom) Audio-grounded evaluation: each LLM is fine-tuned into a LALM using the DeSTA self-distillation framework and evaluated with audio input.
Figure 2: Pearson correlation heatmap across all five evaluation metrics. The white line separates text-only metrics (top-left) from audio-grounded metrics (bottom-right).
Figure 3: Category-level scatter plots comparing cascade and audio-grounded accuracy (%) for 8 fine-tuned LALMs, broken down by Sound, Music, and Speech domains.

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Abstract

How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Authors

Abstract

Table of Contents

Figures (3)