CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov; Rufiz Bayramov; Suad Gafarli; Seljan Musayeva; Taghi Mammadov; Aynur Akhundlu; Murat Kantarcioglu

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Umid Suleymanov, Rufiz Bayramov, Suad Gafarli, Seljan Musayeva, Taghi Mammadov, Aynur Akhundlu, Murat Kantarcioglu

TL;DR

CourtGuard is introduced, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate, and demonstrates that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

Abstract

Current safety mechanisms for Large Language Models (LLMs) rely heavily on static, fine-tuned classifiers that suffer from adaptation rigidity, the inability to enforce new governance rules without expensive retraining. To address this, we introduce CourtGuard, a retrieval-augmented multi-agent framework that reimagines safety evaluation as Evidentiary Debate. By orchestrating an adversarial debate grounded in external policy documents, CourtGuard achieves state-of-the-art performance across 7 safety benchmarks, outperforming dedicated policy-following baselines without fine-tuning. Beyond standard metrics, we highlight two critical capabilities: (1) Zero-Shot Adaptability, where our framework successfully generalized to an out-of-domain Wikipedia Vandalism task (achieving 90\% accuracy) by swapping the reference policy; and (2) Automated Data Curation and Auditing, where we leveraged CourtGuard to curate and audit nine novel datasets of sophisticated adversarial attacks. Our results demonstrate that decoupling safety logic from model weights offers a robust, interpretable, and adaptable path for meeting current and future regulatory requirements in AI governance.

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

TL;DR

Abstract

Paper Structure (67 sections, 7 equations, 16 figures, 18 tables, 2 algorithms)

This paper contains 67 sections, 7 equations, 16 figures, 18 tables, 2 algorithms.

Introduction
Related Work
Static Guardrails and the Alignment Lag
Agentic Adjudication and Debate
Policy-Following and Retrieval-Augmented Safety
Methodology
System Architecture
Policy Grounding RAG Pipeline
Adversarial Debate Module
Attacker Agent ($\mathcal{A}$).
Defender Agent ($\mathcal{D}$).
Judge Evaluation and Verdict
Evaluation Datasets and Metrics
Datasets
Baselines
...and 52 more sections

Figures (16)

Figure 1: Overview of the CourtGuard Framework
Figure 2: Attacker Agent (Prosecutor Mode) RAG System Prompt.
Figure 3: Defender Agent (Defense Counsel Mode) RAG System Prompt.
Figure 4: Judge Agent (Final Adjudicator) RAG System Prompt.
Figure 5: Attacker Agent No-RAG System Prompt.
...and 11 more figures

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

TL;DR

Abstract

CourtGuard: A Model-Agnostic Framework for Zero-Shot Policy Adaptation in LLM Safety

Authors

TL;DR

Abstract

Table of Contents

Figures (16)