Table of Contents
Fetching ...

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

Hyunjong Ok, Suho Yoo, Hyeonjun Kim, Jaeho Lee

TL;DR

AuditoryBench++ introduces a text-only benchmark to evaluate auditory knowledge in Large Language Models, addressing the lack of auditory commonsense without audio input. The authors propose AIR-CoT, a two-stage reasoning framework that detects spans requiring auditory knowledge and injects auditory embeddings via a CLAP-based imagination module, enabling end-to-end auditory reasoning. Empirical results show AIR-CoT outperforms several baselines on pitch, animal sound recognition, and auditory context reasoning, though challenges remain for duration and loudness due to limited temporal cues in current embeddings. Overall, the work provides a foundation for developing language models capable of imagining auditory information, enhancing naturalistic multimodal reasoning without direct audio input.

Abstract

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

AuditoryBench++: Can Language Models Understand Auditory Knowledge without Hearing?

TL;DR

AuditoryBench++ introduces a text-only benchmark to evaluate auditory knowledge in Large Language Models, addressing the lack of auditory commonsense without audio input. The authors propose AIR-CoT, a two-stage reasoning framework that detects spans requiring auditory knowledge and injects auditory embeddings via a CLAP-based imagination module, enabling end-to-end auditory reasoning. Empirical results show AIR-CoT outperforms several baselines on pitch, animal sound recognition, and auditory context reasoning, though challenges remain for duration and loudness due to limited temporal cues in current embeddings. Overall, the work provides a foundation for developing language models capable of imagining auditory information, enhancing naturalistic multimodal reasoning without direct audio input.

Abstract

Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at https://auditorybenchpp.github.io.

Paper Structure

This paper contains 6 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overview of AuditoryBench++, which assesses auditory knowledge of language models without audio input.
  • Figure 2: Pipeline of the proposed AIR-CoT. (a) Data Preparation. Training data is augmented with [imagine] tokens to mark spans requiring auditory reasoning. (b) Stage 1: Span Detection. The model is fine-tuned to detect the spans by generating the special tokens during decoding. (c) Stage 2: Knowledge Injection. When encountering the [/imagine] token, the model pauses to generate the embedding using CLAP and injects it for auditory reasoning.