Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit
TL;DR
This paper investigates how Llama-2Chat models resolve conflicts between competing objectives by studying the forbidden fact task, where the model must truthfully recall facts while avoiding a forbidden word. The authors decompose the model into over a thousand residual-stream components and show that roughly 35 components—predominantly attention heads with suppressive OV circuits and a smaller set of MLPs—sufficiently reproduce the full suppression behavior. They reveal heterogeneous, sometimes nonintuitive attention patterns and demonstrate that a manual adversarial attack, the California Attack, can exploit these mechanisms in certain model sizes. The findings challenge the prospects of straightforward mechanistic interpretability for advanced models and motivate exploring alternative representations or bases to understand complex AI systems. The work underscores both the potential fragility of interpretability claims and the need for robust frameworks to study competing objectives in large transformers.
Abstract
LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io .
