Table of Contents
Fetching ...

Mechanistic Anomaly Detection for "Quirky" Language Models

David O. Johnston, Arkajyoti Chakraborty, Nora Belrose

TL;DR

Mechanistic Anomaly Detection (MAD) investigates using internal model signals to flag anomalous behavioral episodes in language models without prescribing the underlying causes. The authors train detectors on trusted internals from quirky finetunes of Llama 3.1 and Mistral 7B, and evaluate a broad suite of offline and online anomaly scores derived from activations, attribution patching, and probing signals. Results show detectors can be highly discriminative on some arithmetic tasks but fail to generalize across models and non-arithmetic tasks, highlighting model- and task-specific limitations and the need for improved evaluation for high-stakes deployment. Overall, MAD offers a promising, but not yet universal, approach to scalable oversight of capable LLMs, with substantial work remaining to achieve robust performance across realistic settings.

Abstract

As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

Mechanistic Anomaly Detection for "Quirky" Language Models

TL;DR

Mechanistic Anomaly Detection (MAD) investigates using internal model signals to flag anomalous behavioral episodes in language models without prescribing the underlying causes. The authors train detectors on trusted internals from quirky finetunes of Llama 3.1 and Mistral 7B, and evaluate a broad suite of offline and online anomaly scores derived from activations, attribution patching, and probing signals. Results show detectors can be highly discriminative on some arithmetic tasks but fail to generalize across models and non-arithmetic tasks, highlighting model- and task-specific limitations and the need for improved evaluation for high-stakes deployment. Overall, MAD offers a promising, but not yet universal, approach to scalable oversight of capable LLMs, with substantial work remaining to achieve robust performance across realistic settings.

Abstract

As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

Paper Structure

This paper contains 29 sections, 1 equation, 7 figures, 23 tables.

Figures (7)

  • Figure 1: Class balance of labels in low difficulty (trusted) and high difficulty (test) partitions. While many datasets exhibit small to medium shifts in label balance, two datasets stand out as exhibiting very large shifts: SciQ and Population.
  • Figure 2: Comparison of Mean AUC-ROC Scores across different datasets, evaluating the impact of introducing additional names into the experimental setup. All scores were computed using activation features, Mahalanobis distance score and Mistral 7B v0.1. In all cases except for Modular Addition the addition of more names makes anomaly detection worse or at least no better, and in many cases much worse.
  • Figure 3: Correlation between different anomaly detection methods across datasets and layers. (a) Correlation between activations/Mahalanobis and activations/LOF detectors. (b) Correlation between activations/Mahalanobis and attribution/LOF detectors. Each point represents a detector trained on a particular dataset at a particular layer. (c) Correlation between activations/Mahalanobis and SAE/$L_0$ detectors. Note that we had SAE features for fewer layers, which is why this plot contains fewer points.
  • Figure 4: Quirkiness (a measure of how much switching the prompt label impacts model behaviour) vs anomaly detection AUC for the activation/Mahalanobis detector.
  • Figure 5: Linear separation of activations for Alice and Bob examples vs AUC for the activation/Mahalanobis detector. Each point represents a single anomaly detector trained on a single layer of the respective model on the respective dataset.
  • ...and 2 more figures