Table of Contents
Fetching ...

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

Amir Ivry, Shinji Watanabe

TL;DR

This work presents LALM-as-a-Judge, a controlled benchmark to evaluate large audio-language models as safety judges in multi-turn spoken dialogues. It constructs 24,000 unsafe dialogue variants by replacing a single turn and validates safety perceptions with a human anchor study, enabling fine-grained attribution of judgments to localized harms. The study benchmarks three open-source LALMs (Qwen2-Audio, Audio Flamingo 3, MERaLiON) and a text baseline (LLaMA) across audio-only, transcription-only, and multimodal inputs, analyzing sensitivity, specificity, and position bias under varied prompting strategies. Key findings reveal strong architecture- and modality-dependent trade-offs: higher sensitivity often entails lower stability across turns, while audio cues can improve severity ordering in category-specific cases; transcription quality (e.g., Whisper) can bottleneck detection, making end-to-end evaluation essential. The authors provide actionable guidelines and a practitioner flowchart to help implement safe, audio-aware safety judgments in deployed spoken-dialogue systems, highlighting ethical considerations and the need for human oversight alongside automated judges.

Abstract

Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.

LALM-as-a-Judge: Benchmarking Large Audio-Language Models for Safety Evaluation in Multi-Turn Spoken Dialogues

TL;DR

This work presents LALM-as-a-Judge, a controlled benchmark to evaluate large audio-language models as safety judges in multi-turn spoken dialogues. It constructs 24,000 unsafe dialogue variants by replacing a single turn and validates safety perceptions with a human anchor study, enabling fine-grained attribution of judgments to localized harms. The study benchmarks three open-source LALMs (Qwen2-Audio, Audio Flamingo 3, MERaLiON) and a text baseline (LLaMA) across audio-only, transcription-only, and multimodal inputs, analyzing sensitivity, specificity, and position bias under varied prompting strategies. Key findings reveal strong architecture- and modality-dependent trade-offs: higher sensitivity often entails lower stability across turns, while audio cues can improve severity ordering in category-specific cases; transcription quality (e.g., Whisper) can bottleneck detection, making end-to-end evaluation essential. The authors provide actionable guidelines and a practitioner flowchart to help implement safe, audio-aware safety judgments in deployed spoken-dialogue systems, highlighting ethical considerations and the need for human oversight alongside automated judges.

Abstract

Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.
Paper Structure (54 sections, 5 equations, 10 figures, 8 tables)

This paper contains 54 sections, 5 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Controlled generation of unsafe spoken dialogue variants and safety evaluation pipeline. Starting from a safe multi-turn spoken dialogue, a single target turn is selected and replaced in a controlled manner: GPT-4o generates a revised transcription and emotion label conditioned on the full dialogue context, an explicit unsafe category, and a predefined severity level, while all other turns remain unchanged. The revised turn is synthesized into speech using speaker-conditioned TTS and reintegrated into the original dialogue, yielding an unsafe variant that differs in exactly one turn. Each resulting dialogue is then evaluated by a Large Audio-Language Model (LALM) acting as a safety judge, which produces a scalar safety score in $\left[0,1\right]$, under multiple transcription sources and input modalities. The design enables fine-grained attribution of safety judgments to localized unsafe content while controlling broader conversational context.
  • Figure 2: Severity-response profiles of safety scores under two prompt-selected operating points: the top row uses a cross-validation (CV)-selected strategy that maximizes $\mathrm{Sensitivity}$, while the bottom row uses one that maximizes $\mathrm{Specificity}$, independently per configuration. Curves show mean drop in safety score from the safe baseline (defined in \ref{['eq:delta']}) with 95% confidence intervals as severity increases, comparing audio-only, transcription-only, and multimodal inputs and swapping ground-truth vs. Whisper transcriptions. A key takeaway is that the shape of the curves distinguishes a behavior that detects unsafety but compresses severity differences from a behavior that shows little movement at mild severities, which are attributed to $\mathrm{Sensitivity}$- and $\mathrm{Spec}$-optimization, respectively.
  • Figure 3: Category-resolved $\mathrm{Sens}$ (top) and $\mathrm{Spec}$ (bottom) for each configuration, decomposed by modality and transcription source. The figure highlights that "average" performance can hide strong slice effects: the same configuration can be well-behaved for some policy categories yet exhibit weak separation or non-monotone severity structure in others. A practical takeaway is that configuration choices should be audited at the category level when policy coverage matters, rather than relying only on pooled averages.
  • Figure 4: PB as a function of dialogue length (3-10 turns) for both $\mathrm{Sens}$ (top row) and $\mathrm{Spec}$ (bottom row), for $k=\ell$ for $\ell$-turn dialogues. Each subplot shows how stability changes as the unsafe turn is moved last in longer dialogue contexts, across modalities and transcription sources. The key takeaway is that stability is not a constant property of a judge: some configurations remain flat across lengths, while others exhibit length-amplified bias-diagnosing context or recency effects that become more pronounced in longer or multimodal settings.
  • Figure 5: Pareto frontiers trading off score quality against positional stability. Each point corresponds to a full evaluation configuration, plotted as (a) $\mathrm{Sens}$ vs. its Average Absolute PB (AAPB$_\text{sens}$) and (b) $\mathrm{Spec}$ vs. AAPB$_\text{spec}$. AAPB of a metric averages its PB magnitude across dialogue lengths. The dashed curve marks non-dominated choices, making the "best" configuration conditional on a practitioner’s stability budget: improvements in detection or ordering often come with increased position dependence, while some gains are achievable primarily by switching modality or transcription source rather than changing the judge itself.
  • ...and 5 more figures