Table of Contents
Fetching ...

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

Taehan Lee, Jaehan Jung, Hyukjun Lee

TL;DR

A large-scale evaluation of event grounding and false alarms as auditory scene complexity increases is presented, showing that models become more uncertain on multi-event audio, revealing room for improvement.

Abstract

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

TL;DR

A large-scale evaluation of event grounding and false alarms as auditory scene complexity increases is presented, showing that models become more uncertain on multi-event audio, revealing room for improvement.

Abstract

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.
Paper Structure (12 sections, 3 equations, 5 figures, 4 tables)

This paper contains 12 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overall pipeline of audio event extraction.
  • Figure 2: $\operatorname{erank}$ distribution of audio embeddings from EAT (left) and SSLAM (right).
  • Figure 3: Visualization of the event embedding space on principal component axes. Blue dots indicate all other events.
  • Figure 4: True Positive Rate (green) on present-event detection and False Positive Rate (red) on absent-event detection, plotted against the number of events across audio LLMs. Solid (-), dashed ($--$) and dotted ($..$) lines indicate the average across prompts, the prompt with the highest TPR / lowest FPR, and the prompt with the lowest TPR / highest FPR, respectively.
  • Figure 5: Normalized output token probability distribution of Audio LLMs. Yes / No scores are at the left / right axis, respectively.