Table of Contents
Fetching ...

RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu

TL;DR

RedDebate presents a fully automated, multi-agent debate framework that jointly performs automated red-teaming and safety learning for LLMs. By coupling collaborative debates among $N$ debaters with external evaluation, memory modules (TLTM, CLTM, TLTM+CLTM, GLTM), and guardrails, the approach progressively mitigates unsafe outputs across HarmBench and CoSafe, outperforming self-critique baselines. Key findings show that debate alone reduces unsafe responses and memory augments safety further, with guardrails providing strong preventative safety; these gains survive across multiple model families and configurations. The work suggests a scalable, human-in-the-loop-free path to safer LLMs through structured disagreement, persistent memory, and programmatic safety constraints, albeit with evaluator limitations and deployment considerations noted.

Abstract

We introduce RedDebate, a novel multi-agent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their unsafe behaviours. Existing AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another's reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. We further integrate distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behaviour. Empirical evaluation on safety benchmarks across a diverse set of models demonstrates that RedDebate substantially reduces unsafe outputs. While debate alone allows LLMs to refine their behaviour, the addition of memory yields further significant reductions. To the best of our knowledge, RedDebate is the first fully automated framework to unify multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

RedDebate: Safer Responses through Multi-Agent Red Teaming Debates

TL;DR

RedDebate presents a fully automated, multi-agent debate framework that jointly performs automated red-teaming and safety learning for LLMs. By coupling collaborative debates among debaters with external evaluation, memory modules (TLTM, CLTM, TLTM+CLTM, GLTM), and guardrails, the approach progressively mitigates unsafe outputs across HarmBench and CoSafe, outperforming self-critique baselines. Key findings show that debate alone reduces unsafe responses and memory augments safety further, with guardrails providing strong preventative safety; these gains survive across multiple model families and configurations. The work suggests a scalable, human-in-the-loop-free path to safer LLMs through structured disagreement, persistent memory, and programmatic safety constraints, albeit with evaluator limitations and deployment considerations noted.

Abstract

We introduce RedDebate, a novel multi-agent debate framework that provides the foundation for Large Language Models (LLMs) to identify and mitigate their unsafe behaviours. Existing AI safety approaches often rely on costly human evaluation or isolated single-model assessment, both constrained by scalability and prone to oversight failures. RedDebate employs collaborative argumentation among multiple LLMs across diverse debate scenarios, enabling them to critically evaluate one another's reasoning and systematically uncover unsafe failure modes through fully automated red-teaming. We further integrate distinct long-term memory modules that preserve safety-relevant insights from debate interactions and leverage them during subsequent inference, facilitating continuous refinement of model behaviour. Empirical evaluation on safety benchmarks across a diverse set of models demonstrates that RedDebate substantially reduces unsafe outputs. While debate alone allows LLMs to refine their behaviour, the addition of memory yields further significant reductions. To the best of our knowledge, RedDebate is the first fully automated framework to unify multi-agent debate and red-teaming to progressively enhance LLM safety without human intervention.

Paper Structure

This paper contains 68 sections, 5 equations, 22 figures, 9 tables, 3 algorithms.

Figures (22)

  • Figure 1: RedDebate framework overview. Multiple agents debate a red-teaming prompt across several rounds, refining their responses through peer interaction. An evaluator analyzes the outputs, flags unsafe patterns, and provides feedback. Distilled safety insights are stored in memory to prevent similar mistakes, enabling continual automated improvement.
  • Figure 2: RedDebate Framework
  • Figure 3: Stepwise error rates for debate and self-critique.
  • Figure 4: Self-critique under extended revisions with matched token budget versus debate.
  • Figure 5: Vulnerability heatmaps and attack metrics across settings.
  • ...and 17 more figures