Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Matthew Pisano; Peter Ly; Abraham Sanders; Bingsheng Yao; Dakuo Wang; Tomek Strzalkowski; Mei Si

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Matthew Pisano, Peter Ly, Abraham Sanders, Bingsheng Yao, Dakuo Wang, Tomek Strzalkowski, Mei Si

TL;DR

The paper addresses vulnerabilities in large language model alignment under adversarial prompting by introducing Bergeron, a two-tier guardian framework where a secondary 'conscience' model critiques prompts and outputs to steer the primary model toward safety without fine-tuning. It formalizes content disclaimers and a critique-correct mechanism that can be appended to prompts, demonstrating substantial improvements in defense against a broad set of adversarial prompts across multiple base models, while maintaining reasonable computational overhead via small secondary models. The authors construct diverse evaluation datasets (adversarial, mundane, and MMLU prompts) and report high adversarial detection rates, low false positives, and substantial reductions in unsafe outputs, with average defense gains exceeding 40 percentage points and some configurations nearing GPT-4 performance. The work also discusses ethical risks, transparency concerns, and avenues for future work, including ablations and iterative critique strategies to further bolster robustness.

Abstract

Research into AI alignment has grown considerably since the recent introduction of increasingly capable Large Language Models (LLMs). Unfortunately, modern methods of alignment still fail to fully prevent harmful responses when models are deliberately attacked. Such vulnerabilities can lead to LLMs being manipulated into generating hazardous content: from instructions for creating dangerous materials to inciting violence or endorsing unethical behaviors. To help mitigate this issue, we introduce Bergeron: a framework designed to improve the robustness of LLMs against attacks without any additional parameter fine-tuning. Bergeron is organized into two tiers; with a secondary LLM acting as a guardian to the primary LLM. This framework better safeguards the primary model against incoming attacks while monitoring its output for any harmful content. Empirical analysis reviews that by using Bergeron to complement models with existing alignment training, we can significantly improve the robustness and safety of multiple, commonly used commercial and open-source LLMs. Specifically, we found that models integrated with Bergeron are, on average, nearly seven times more resistant to attacks compared to models without such support.

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

TL;DR

Abstract

Paper Structure (46 sections, 3 equations, 25 figures, 8 tables, 1 algorithm)

This paper contains 46 sections, 3 equations, 25 figures, 8 tables, 1 algorithm.

Introduction
Augmenting Alignment
Related Work
Modern Jailbreaks and Attacks
External Remedies for the Alignment Problem
Modeling Content Disclaimers
The Bergeron Framework ($\mathcal{B}$)
Primary Model ($\mathcal{P}$)
Secondary Model ($\mathcal{S}$)
Evaluation Datasets
Adversarial Prompts
Choosing Attack Topics
Mundane Prompts
MMLU Prompts
Evaluating The Framework
...and 31 more sections

Figures (25)

Figure 1: An unsafe response from GPT-4 to a harmful adversarial prompt (Jan. 21st, 2024).
Figure 2: (Left) A vulnerable LLM with only weight-based alignment, (Right) our Bergeron framework that protects against unsafe prompts and responses. Red text contains unsafe, orange text may be unsafe, and green text has been corrected/judged as safe.
Figure 3: A valid critique to an unsafe prompt given by $\mathcal{S}(GPT-3.5)$ (Aug. 2nd, 2024).
Figure 4: The prompt given to the secondary model when it is to critique a price of text. This text replaces the phrase {TEXT} when the model is prompted. It gives general instructions on what to look out for and gives two correct examples. This prompt also encourages the model to generate an explanation before a judgement, allowing it to reason through the text more thoroughly.
Figure 5: The prompt given to the primary model when the secondary model has identified a prompt as potentially unsafe. This text replaces the phrases {PROMPT} and {PROMPT_CRITIQUE} when the model is prompted. It introduces the critique as coming from the model's "conscience", increasing the perceived authority of the critique. This encourages the primary model to pay more attention to the potentially unsafe aspects of the prompt.
...and 20 more figures

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

TL;DR

Abstract

Bergeron: Combating Adversarial Attacks through a Conscience-Based Alignment Framework

Authors

TL;DR

Abstract

Table of Contents

Figures (25)