Table of Contents
Fetching ...

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

Chen-An Li, Tzu-Han Lin, Hung-yi Lee

TL;DR

The paper addresses the robustness of large audio-language models to irrelevant audio during text-only reasoning. It systematically evaluates several open-source LALMs on GSM8K, ARC-Challenge, and MMLU with non-informative audio types (silence, Gaussian noise, and FSD50K) and analyzes effects under varying duration $d$, amplitude $a$, and decoding temperature $T$, using metrics such as $Accuracy = n_c/N$ and $Influence\ Rate = (n_{ic} + n_{ci})/N$. The key findings show that irrelevant audio degrades accuracy and increases prediction volatility across tasks, with silence and noise often yielding similar disruptions, and larger models offering partial resilience but not full robustness. Mitigation experiments reveal that prompting is largely ineffective, while self-consistency improves stability at a substantial compute cost, underscoring the need for more efficient fusion strategies to preserve reasoning under realistic multimodal inputs.

Abstract

Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models

TL;DR

The paper addresses the robustness of large audio-language models to irrelevant audio during text-only reasoning. It systematically evaluates several open-source LALMs on GSM8K, ARC-Challenge, and MMLU with non-informative audio types (silence, Gaussian noise, and FSD50K) and analyzes effects under varying duration , amplitude , and decoding temperature , using metrics such as and . The key findings show that irrelevant audio degrades accuracy and increases prediction volatility across tasks, with silence and noise often yielding similar disruptions, and larger models offering partial resilience but not full robustness. Mitigation experiments reveal that prompting is largely ineffective, while self-consistency improves stability at a substantial compute cost, underscoring the need for more efficient fusion strategies to preserve reasoning under realistic multimodal inputs.

Abstract

Large audio-language models (LALMs) unify speech and text processing, but their robustness in noisy real-world settings remains underexplored. We investigate how irrelevant audio, such as silence, synthetic noise, and environmental sounds, affects text reasoning tasks where audio is unnecessary. Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding temperatures. Silence, often assumed neutral, destabilizes outputs as strongly as synthetic noise. While larger models show greater resilience, vulnerabilities persist across all evaluated systems. We further test mitigation strategies and find that prompting shows limited effectiveness, whereas self-consistency improves stability at the cost of increased computation. Our results reveal cross-modal interference as a key robustness challenge and highlight the need for efficient fusion strategies that preserve reasoning performance in the presence of irrelevant inputs.

Paper Structure

This paper contains 13 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of cross-modal interference: irrelevant audio disrupts text-only reasoning in LALMs
  • Figure 2: Accuracy (Acc) and Influence Rate (IR) of LALMs under cross-modal interference across benchmarks.
  • Figure 3: Impact of varying silence and noise durations (1, 5, 10, 30 sec) and noise amplitudes (-60, -40, -20 dBFS) on GSM8K and ARC-Challenge.
  • Figure 4: Effect of decoding temperature on accuracy and influence rate under audio interference on GSM8K. Non-greedy results are averaged over 3 seeds with standard deviation.