LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues
Amir Ivry, Shinji Watanabe
TL;DR
This work presents LALM-as-a-Judge, a controlled benchmark to evaluate large audio-language models as safety judges in multi-turn spoken dialogues. It constructs 24,000 unsafe dialogue variants by replacing a single turn and validates safety perceptions with a human anchor study, enabling fine-grained attribution of judgments to localized harms. The study benchmarks three open-source LALMs (Qwen2-Audio, Audio Flamingo 3, MERaLiON) and a text baseline (LLaMA) across audio-only, transcription-only, and multimodal inputs, analyzing sensitivity, specificity, and position bias under varied prompting strategies. Key findings reveal strong architecture- and modality-dependent trade-offs: higher sensitivity often entails lower stability across turns, while audio cues can improve severity ordering in category-specific cases; transcription quality (e.g., Whisper) can bottleneck detection, making end-to-end evaluation essential. The authors provide actionable guidelines and a practitioner flowchart to help implement safe, audio-aware safety judgments in deployed spoken-dialogue systems, highlighting ethical considerations and the need for human oversight alongside automated judges.
Abstract
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.
