Table of Contents
Fetching ...

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang, Tianrong Liu, Dongcheng Zhao, Yi Zeng

TL;DR

This paper tackles scalable safety evaluation for LLMs by combining a large, human-annotated jailbreak benchmark (HAJailBench) with a cost-efficient Multi-Agent Judge framework that uses structured debates among critic, defender, and judge roles implemented by Small Language Models. The approach employs a pre-debate value-alignment step and iterative, role-conditioned discourse to surface semantic safety issues, achieving near frontier-model reliability while reducing inference costs by a substantial margin. Key findings include that three debate rounds optimally balance accuracy and efficiency, and that HAJailBench provides a robust ground truth for evaluating judge reliability and safety performance. Together, the dataset and framework offer a reproducible, interpretable, and scalable pathway for LLM safety assessment in real-world, cost-sensitive settings.

Abstract

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

Efficient LLM Safety Evaluation through Multi-Agent Debate

TL;DR

This paper tackles scalable safety evaluation for LLMs by combining a large, human-annotated jailbreak benchmark (HAJailBench) with a cost-efficient Multi-Agent Judge framework that uses structured debates among critic, defender, and judge roles implemented by Small Language Models. The approach employs a pre-debate value-alignment step and iterative, role-conditioned discourse to surface semantic safety issues, achieving near frontier-model reliability while reducing inference costs by a substantial margin. Key findings include that three debate rounds optimally balance accuracy and efficiency, and that HAJailBench provides a robust ground truth for evaluating judge reliability and safety performance. Together, the dataset and framework offer a reproducible, interpretable, and scalable pathway for LLM safety assessment in real-world, cost-sensitive settings.

Abstract

Safety evaluation of large language models (LLMs) increasingly relies on LLM-as-a-Judge frameworks, but the high cost of frontier models limits scalability. We propose a cost-efficient multi-agent judging framework that employs Small Language Models (SLMs) through structured debates among critic, defender, and judge agents. To rigorously assess safety judgments, we construct HAJailBench, a large-scale human-annotated jailbreak benchmark comprising 12,000 adversarial interactions across diverse attack methods and target models. The dataset provides fine-grained, expert-labeled ground truth for evaluating both safety robustness and judge reliability. Our SLM-based framework achieves agreement comparable to GPT-4o judges on HAJailBench while substantially reducing inference cost. Ablation results show that three rounds of debate yield the optimal balance between accuracy and efficiency. These findings demonstrate that structured, value-aligned debate enables SLMs to capture semantic nuances of jailbreak attacks and that HAJailBench offers a reliable foundation for scalable LLM safety evaluation.

Paper Structure

This paper contains 29 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the proposed benchmark and multi-agent judge framework. A value-alignment step enumerates five safety aspects to guide a structured debate among role-specific agents (critic, defender, judge). The judge consolidates arguments into fine-grained outputs: binary attack success, five-level risk, and a ten-point risk score.
  • Figure 2: Agreement between safety LLM judges, including rule-based judge (GCG) gcg_attack, fine-tuned judge (Llama-Guard llama_guard, JudgeLM judgelm), and two types of single-turn prompt-based judge, namely pair judge from pair_attack, and align judge that utilized our framework's final judge prompt
  • Figure 3: Human-labeled ASR and mean score of single turn attack method and target model pairs.
  • Figure 4: Comprehensive comparison of judge performance on our benchmark dataset, showing the $\kappa$ agreement score and unit cost across different evaluation methods (e.g., multi-agent judge vs. baseline judges). (a) shows the direct relations between unit cost and $\kappa$ score for different judge algorithms across various base models. (b) and (c) compares the average unit cost and $\kappa$ score of different judge algorithms accross the same set of base models.