When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration
Jayadev Billa
TL;DR
This work introduces ALME, a multilingual benchmark for measuring modality arbitration in audio-LLMs by analyzing how models resolve conflicts between audio and text prompts. It formalizes the Text Dominance Ratio (TDR) and reveals a robust 10x gap: models follow text far more often than audio when both conflict under identical reliability cues, even though audio preserves more information than ASR transcripts. The study demonstrates cross-model and cross-linguistic variation, shows that prompting strategies can reduce text dominance, and provides interventional evidence via fine-tuning that the arbitration locus largely resides in the LLM's reasoning rather than the audio encoder. These findings have practical implications for model selection, per-language evaluation, and robustness considerations, and they establish a framework for reasoning about modality arbitration beyond traditional transcription-focused benchmarks.
Abstract
When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.
