Table of Contents
Fetching ...

Improving Alignment and Robustness with Circuit Breakers

Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks

TL;DR

The paper tackles the fragility of alignment methods by introducing circuit breakers that operate on internal representations to halt harmful outputs. Using Representation Rerouting (RR) within a Representation Engineering (RepE) framework, it ties harmful representations to breakers via two data sets (Circuit Breaker Set and Retain Set) and a cosine-based rerouting loss implemented with Low-Rank Representation Adaptation. Empirically, RR substantially reduces harmful outputs across large language models, multimodal systems, and AI agents with minimal loss to capability, outperforming refusal-based and adversarial training baselines. These results suggest a promising, attack-agnostic path toward safer, more reliable generative AI systems in diverse settings.

Abstract

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, circuit breakers allow the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

Improving Alignment and Robustness with Circuit Breakers

TL;DR

The paper tackles the fragility of alignment methods by introducing circuit breakers that operate on internal representations to halt harmful outputs. Using Representation Rerouting (RR) within a Representation Engineering (RepE) framework, it ties harmful representations to breakers via two data sets (Circuit Breaker Set and Retain Set) and a cosine-based rerouting loss implemented with Low-Rank Representation Adaptation. Empirically, RR substantially reduces harmful outputs across large language models, multimodal systems, and AI agents with minimal loss to capability, outperforming refusal-based and adversarial training baselines. These results suggest a promising, attack-agnostic path toward safer, more reliable generative AI systems in diverse settings.

Abstract

AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal language models to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, circuit breakers allow the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.
Paper Structure (37 sections, 13 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 13 figures, 8 tables, 1 algorithm.

Figures (13)

  • Figure 1: Introduction of circuit-breaking as a novel approach for constructing highly reliable safeguards. Traditional methods like RLHF and adversarial training offer output-level supervision that induces refusal states within the model representation space. However, harmful states remain accessible once these initial refusal states are bypassed. In contrast, inspired by representation engineering zou2023representation, circuit breaking operate directly on internal representations, linking harmful states to circuit breakers. This impedes traversal through a sequence of harmful states.
  • Figure 2: Adding circuit breakers using Representation Rerouting (RR) to refusal trained Llama-3-8B-Instruct model leads to significantly lower attack success rate (ASR) over a wide range of unseen attacks on HarmBench prompts mazeika2024harmbench, while its capabilities on standard LLM benchmarks (MT Bench and MMLU) are largely preserved. RR directly targets the representations that give rise to harmful outputs and reroutes them to an orthogonal space. This reliably interrupts the model from completing the harmful generations even under strong adversarial pressure.
  • Figure 3: Circuit-breaking performance in multimodal settings with Representation Rerouting (RR). Under Projected Gradient Descent (PGD) attack, our LLaVA-NeXT-Mistral-7B (+ RR) with circuit breakers is significantly more robust compared to the original model even with a safety prompt that instructs the model to avoid harmful responses. Performance on multimodal capabilities benchmarks MMMU and LLaVA-Wild is preserved.
  • Figure 4: Circuit-breaking performance in AI agent settings with Representation Rerouting (RR). Our Llama-3-8B-Instruct (+ RR) with circuit breakers remains robust under Direct Request and Forced Function Calls, while retaining performance on the Berkeley Function Calling Leaderboard (BFCL).
  • Figure 5: Circuit Breaker set ablation across categories of harm, averaged over the same 6 attacks. The robustness performance on test categories of harms largely depends on the closeness of the training distribution.
  • ...and 8 more figures