When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Jayadev Billa

TL;DR

This work introduces ALME, a multilingual benchmark for measuring modality arbitration in audio-LLMs by analyzing how models resolve conflicts between audio and text prompts. It formalizes the Text Dominance Ratio (TDR) and reveals a robust 10x gap: models follow text far more often than audio when both conflict under identical reliability cues, even though audio preserves more information than ASR transcripts. The study demonstrates cross-model and cross-linguistic variation, shows that prompting strategies can reduce text dominance, and provides interventional evidence via fine-tuning that the arbitration locus largely resides in the LLM's reasoning rather than the audio encoder. These findings have practical implications for model selection, per-language evaluation, and robustness considerations, and they establish a framework for reasoning about modality arbitration beyond traditional transcription-focused benchmarks.

Abstract

When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

TL;DR

Abstract

23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.

Paper Structure (44 sections, 1 equation, 6 figures, 20 tables)

This paper contains 44 sections, 1 equation, 6 figures, 20 tables.

Introduction
Related Work
Audio-LLM Evaluation
Audio-Text Conflict
Modality Bias in Multimodal Models
Methodology
Experimental Design
Evaluation Conditions
Stimulus Generation
Dataset
Stimulus Validation
Models Evaluated
Evaluation Protocol
Control Conditions
Cascade Baseline
...and 29 more sections

Figures (6)

Figure 1: The four evaluation conditions. In audio-only, text-only, and aligned conditions, there is a clear correct answer. In the conflict condition (bottom, highlighted), audio and text disagree, and the model's choice reveals its modality preference. If it answers "three" (audio), it followed audio; if "five" (text), it followed text. TDR measures how often models follow text.
Figure 2: Stimulus generation pipeline. Natural speech from Common Voice is filtered, analyzed for flippable semantic elements, paired with LLM-generated questions, and validated.
Figure 3: Prompt templates for aligned and conflict conditions. Gray: shared elements. Cyan: aligned-only framing. Orange: conflict-only (transcript flagged as potentially incorrect, with explicit instruction to follow audio). Green: forced-choice output constraint, identical across all conditions. The system prompt (top) and per-message suffix (bottom of each condition) jointly enforce verbatim selection from the provided choices.
Figure 4: Prompt intervention variants for testing whether text dominance can be reduced through instruction design alone. All use the same audio and conflict text; only the framing of the conflict transcript differs.
Figure 5: Text Dominance Ratio across four audio-LLMs ($n$=57,602). TDR ranges from 16.6% (Gemini) to 63.2% (Qwen2-Audio). Horizontal line indicates no modality preference (TDR=50%).
...and 1 more figures

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

TL;DR

Abstract

When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Authors

TL;DR

Abstract

Table of Contents

Figures (6)