Table of Contents
Fetching ...

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

TL;DR

This work introduces ALME, a multilingual benchmark for measuring modality arbitration in audio-LLMs by analyzing how models resolve conflicts between audio and text prompts. It formalizes the Text Dominance Ratio (TDR) and reveals a robust 10x gap: models follow text far more often than audio when both conflict under identical reliability cues, even though audio preserves more information than ASR transcripts. The study demonstrates cross-model and cross-linguistic variation, shows that prompting strategies can reduce text dominance, and provides interventional evidence via fine-tuning that the arbitration locus largely resides in the LLM's reasoning rather than the audio encoder. These findings have practical implications for model selection, per-language evaluation, and robustness considerations, and they establish a framework for reasoning about modality arbitration beyond traditional transcription-focused benchmarks.

Abstract

When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

TL;DR

This work introduces ALME, a multilingual benchmark for measuring modality arbitration in audio-LLMs by analyzing how models resolve conflicts between audio and text prompts. It formalizes the Text Dominance Ratio (TDR) and reveals a robust 10x gap: models follow text far more often than audio when both conflict under identical reliability cues, even though audio preserves more information than ASR transcripts. The study demonstrates cross-model and cross-linguistic variation, shows that prompting strategies can reduce text dominance, and provides interventional evidence via fine-tuning that the arbitration locus largely resides in the LLM's reasoning rather than the audio encoder. These findings have practical implications for model selection, per-language evaluation, and robustness considerations, and they establish a framework for reasoning about modality arbitration beyond traditional transcription-focused benchmarks.

Abstract

When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it (23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.
Paper Structure (44 sections, 1 equation, 6 figures, 20 tables)

This paper contains 44 sections, 1 equation, 6 figures, 20 tables.

Figures (6)

  • Figure 1: The four evaluation conditions. In audio-only, text-only, and aligned conditions, there is a clear correct answer. In the conflict condition (bottom, highlighted), audio and text disagree, and the model's choice reveals its modality preference. If it answers "three" (audio), it followed audio; if "five" (text), it followed text. TDR measures how often models follow text.
  • Figure 2: Stimulus generation pipeline. Natural speech from Common Voice is filtered, analyzed for flippable semantic elements, paired with LLM-generated questions, and validated.
  • Figure 3: Prompt templates for aligned and conflict conditions. Gray: shared elements. Cyan: aligned-only framing. Orange: conflict-only (transcript flagged as potentially incorrect, with explicit instruction to follow audio). Green: forced-choice output constraint, identical across all conditions. The system prompt (top) and per-message suffix (bottom of each condition) jointly enforce verbatim selection from the provided choices.
  • Figure 4: Prompt intervention variants for testing whether text dominance can be reduced through instruction design alone. All use the same audio and conflict text; only the framing of the conflict transcript differs.
  • Figure 5: Text Dominance Ratio across four audio-LLMs ($n$=57,602). TDR ranges from 16.6% (Gemini) to 63.2% (Qwen2-Audio). Horizontal line indicates no modality preference (TDR=50%).
  • ...and 1 more figures