Table of Contents
Fetching ...

The Persistent Vulnerability of Aligned AI Systems

Aengus Lynch

Abstract

Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.

The Persistent Vulnerability of Aligned AI Systems

Abstract

Autonomous AI agents are being deployed with filesystem access, email control, and multi-step planning. This thesis contributes to four open problems in AI safety: understanding dangerous internal computations, removing dangerous behaviors once embedded, testing for vulnerabilities before deployment, and predicting when models will act against deployers. ACDC automates circuit discovery in transformers, recovering all five component types from prior manual work on GPT-2 Small by selecting 68 edges from 32,000 candidates in hours rather than months. Latent Adversarial Training (LAT) removes dangerous behaviors by optimizing perturbations in the residual stream to elicit failure modes, then training under those perturbations. LAT solved the sleeper agent problem where standard safety training failed, matching existing defenses with 700x fewer GPU hours. Best-of-N jailbreaking achieves 89% attack success on GPT-4o and 78% on Claude 3.5 Sonnet through random input augmentations. Attack success follows power law scaling across text, vision, and audio, enabling quantitative forecasting of adversarial robustness. Agentic misalignment tests whether frontier models autonomously choose harmful actions given ordinary goals. Across 16 models, agents engaged in blackmail (96% for Claude Opus 4), espionage, and actions causing death. Misbehavior rates rose from 6.5% to 55.1% when models stated scenarios were real rather than evaluations. The thesis does not fully resolve any of these problems but makes each tractable and measurable.

Paper Structure

This paper contains 197 sections, 6 equations, 38 figures, 11 tables, 1 algorithm.

Figures (38)

  • Figure 1: Automatically discovering circuits with ACDC.Left: a computational graph for GPT-2 Small, with a recovered circuit for the IOI task highlighted in red. Only edges between adjacent layers are shown. Right: the recovered circuit with labelled nodes. All heads recovered were identified as part of the IOI circuit by wang2022interpretability. Edge thickness is proportional to importance.
  • Figure 2: How ACDC works (Steps \ref{['fig:pedagogical_choose_graph']}-\ref{['fig:pedagogical_final_graph']}). Step \ref{['fig:pedagogical_choose_graph']}: a practitioner specifies a computational graph of the model, the task they want to investigate, and a threshold under which to remove connections. Step \ref{['fig:pedagogical_edge_pruning']}: ACDC iterates over nodes in the computational graph, replacing activations of connections between a node and its children, and measuring the effect on the output metric. Connections are removed if their measured effect on the metric under corruption is below the threshold $\tau$. Step \ref{['fig:pedagogical_final_graph']}: recursively apply Step \ref{['fig:pedagogical_edge_pruning']} to the remaining nodes. The ACDC procedure returns a subgraph of the original computational graph.
  • Figure 3: ROC curves of ACDC, SP and HISP identifying model components from previous work, across 5 circuits in transformers. The points on the plot are cases where SP and ACDC return subgraphs that are not on the Pareto frontier. The corresponding AUCs are in \ref{['tab:random-auc-new']}.
  • Figure 4: Comparison of ACDC and SP with both zero-input activations (left) and corrupted activations (right). We plot the KL Divergence on a held-out test set against the number of edges of each hypothesized circuit. Lower KL divergence and fewer edges correspond to better subgraphs. Darker points include more edges in the hypothesis: they use smaller ACDC $\tau$, smaller SP regularization $\lambda$ or a higher percentage of nodes in HISP.
  • Figure 5: Targeted Latent Adversarial Training (LAT) in LLMs: We perturb the latent activations in an LLM's residual stream to elicit specific failure modes from the model. Then, we fine-tune LLMs on the target task under these perturbations. We use this approach to improve robustness to jailbreaks (\ref{['sec:jailbreaks']}), remove backdoors without access to the trigger (\ref{['sec:backdoors']}), and unlearn undesirable knowledge (\ref{['sec:unlearning']}).
  • ...and 33 more figures