MedVoiceBias: A Controlled Study of Audio LLM Behavior in Clinical Decision-Making
Zhi Rui Tam, Yun-Nung Chen
TL;DR
MedVoiceBias investigates how paralinguistic cues in audio LLM inputs bias clinical decisions, focusing on binary surgical recommendations. Using a controlled dataset of 170 clinical cases and 36 synthesized voice profiles, the study quantifies modality and demographic effects by comparing audio against text baselines under Direct Answer and Chain-of-Thought prompting. Key findings show substantial modality bias (up to about 35 percentage points) and persistent age-related disparities, while gender bias can be mitigated with explicit reasoning; emotional cues are not reliably detected due to weak emotion recognition. The work highlights critical risks of deploying audio-enabled medical AI and advocates bias-aware architectures to ensure decisions reflect clinical evidence rather than voice characteristics.
Abstract
As large language models transition from text-based interfaces to audio interactions in clinical settings, they might introduce new vulnerabilities through paralinguistic cues in audio. We evaluated these models on 170 clinical cases, each synthesized into speech from 36 distinct voice profiles spanning variations in age, gender, and emotion. Our findings reveal a severe modality bias: surgical recommendations for audio inputs varied by as much as 35% compared to identical text-based inputs, with one model providing 80% fewer recommendations. Further analysis uncovered age disparities of up to 12% between young and elderly voices, which persisted in most models despite chain-of-thought prompting. While explicit reasoning successfully eliminated gender bias, the impact of emotion was not detected due to poor recognition performance. These results demonstrate that audio LLMs are susceptible to making clinical decisions based on a patient's voice characteristics rather than medical evidence, a flaw that risks perpetuating healthcare disparities. We conclude that bias-aware architectures are essential and urgently needed before the clinical deployment of these models.
