Mechanistic Anomaly Detection for "Quirky" Language Models
David O. Johnston, Arkajyoti Chakraborty, Nora Belrose
TL;DR
Mechanistic Anomaly Detection (MAD) investigates using internal model signals to flag anomalous behavioral episodes in language models without prescribing the underlying causes. The authors train detectors on trusted internals from quirky finetunes of Llama 3.1 and Mistral 7B, and evaluate a broad suite of offline and online anomaly scores derived from activations, attribution patching, and probing signals. Results show detectors can be highly discriminative on some arithmetic tasks but fail to generalize across models and non-arithmetic tasks, highlighting model- and task-specific limitations and the need for improved evaluation for high-stakes deployment. Overall, MAD offers a promising, but not yet universal, approach to scalable oversight of capable LLMs, with substantial work remaining to achieve robust performance across realistic settings.
Abstract
As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate Mechanistic Anomaly Detection (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector features and scoring rules to detect anomalies in a set of ``quirky'' language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.
