Table of Contents
Fetching ...

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Tony T. Wang, Miles Wang, Kaivalya Hariharan, Nir Shavit

TL;DR

This paper investigates how Llama-2Chat models resolve conflicts between competing objectives by studying the forbidden fact task, where the model must truthfully recall facts while avoiding a forbidden word. The authors decompose the model into over a thousand residual-stream components and show that roughly 35 components—predominantly attention heads with suppressive OV circuits and a smaller set of MLPs—sufficiently reproduce the full suppression behavior. They reveal heterogeneous, sometimes nonintuitive attention patterns and demonstrate that a manual adversarial attack, the California Attack, can exploit these mechanisms in certain model sizes. The findings challenge the prospects of straightforward mechanistic interpretability for advanced models and motivate exploring alternative representations or bases to understand complex AI systems. The work underscores both the potential fragility of interpretability claims and the need for robust frameworks to study competing objectives in large transformers.

Abstract

LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io .

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

TL;DR

This paper investigates how Llama-2Chat models resolve conflicts between competing objectives by studying the forbidden fact task, where the model must truthfully recall facts while avoiding a forbidden word. The authors decompose the model into over a thousand residual-stream components and show that roughly 35 components—predominantly attention heads with suppressive OV circuits and a smaller set of MLPs—sufficiently reproduce the full suppression behavior. They reveal heterogeneous, sometimes nonintuitive attention patterns and demonstrate that a manual adversarial attack, the California Attack, can exploit these mechanisms in certain model sizes. The findings challenge the prospects of straightforward mechanistic interpretability for advanced models and motivate exploring alternative representations or bases to understand complex AI systems. The work underscores both the potential fragility of interpretability claims and the need for robust frameworks to study competing objectives in large transformers.

Abstract

LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factual recall statement while forbidding it from saying the correct answer. This often makes the model give incorrect answers. We decompose Llama-2 into 1000+ components, and rank each one with respect to how useful it is for forbidding the correct answer. We find that in aggregate, around 35 components are enough to reliably implement the full suppression behavior. However, these components are fairly heterogeneous and many operate using faulty heuristics. We discover that one of these heuristics can be exploited via a manually designed adversarial attack which we call The California Attack. Our results highlight some roadblocks standing in the way of being able to successfully interpret advanced ML systems. Project website available at https://forbiddenfacts.github.io .
Paper Structure (25 sections, 1 theorem, 9 equations, 12 figures)

This paper contains 25 sections, 1 theorem, 9 equations, 12 figures.

Key Result

Theorem D.1

Let $\mathbf{x} \in \mathbb{R}^\mathrm{vocab}$ be a vector of logits, and let $\mathbf{e}_i$ be the $i$th standard basis vector. Then $\mathrm{softmax}(\mathbf{x})_i$ and $\mathrm{softmax}(\mathbf{x} + c \cdot \mathbf{e}_i)_i$ differ by exactly a log Bayes factor of $c$ nats.

Figures (12)

  • Figure 2.1: Left: The probability Llama-2-7b-chat answers a competing prompt correctly vs. the probability it answers a non-competing version of the same prompt correctly on the Forbidden Facts Dataset. The sharp vertical cut-off in the plot is due to dataset filtering (see Appendix \ref{['app:ForbFact']} for details). Right: Effect of (cumulative) first-order-patching in residual stream components from executions on competing prompts into executions on matching non-competing prompts, done across the datapoints on the left plot. The components are ranked using the formula in Section \ref{['sec:decomposing']}. Patching 35 components is enough to achieve the same suppression as patching all 1057 components. We perform the same analysis on the 13b and 70b Llama-2 models in Appendix \ref{['app:scaling']}. We find roughly the same behavior, with 36 and 34 components needed to achieve the same suppression in 13b and 70b respectively, even as the total number of components grow. We hypothesize the sharp drop-off at the end is due to Waluigi components waluigi.
  • Figure 3.1: We plot four types of heads: the top suppression head, a middling suppression head, an irrelevant head, and an anti-suppression head. The top plots are histograms of attention to the forbidden word. The bottom plots are histograms of the OV suppression score over the vocabulary distribution; more negative means more default suppression. We also show the top three tokens each head downweights (upweights for the anti-suppression head).
  • Figure 3.2: Behavior of attention head L18H9 over the Forbidden Facts dataset (lines denote median quantities). From left to right: 1) We fix the query-vector of L18H9 at the final token position and plot how much it attends to partially enriched key-vectors of the forbidden word tokens. Partially enriched key-vectors are generated by feeding output activations of early layers into the key-circuit of L18H9. Key-enrichment by earlier layers is critical for achieving the full attention effect. 2) We fix the key-vectors of the forbidden word tokens, and plot the attention paid to them by partially enriched final-token query-vectors. Partial enrichment is done in the same manner as in the first plot. Query-enrichment is also critical for achieving the full attention effect. 3) We analyze whether the final-token query-vector from a competing run (where the correct answer is forbidden) will attend to key-vectors from a non-competing run (where a word other than the correct-answer is forbidden). We compare this against the baseline of how much the final token attends to the forbidden word in unmodified competing runs. The x-axis shows the log-odds of attention in the baseline run, and the y-axis shows the log-odds of attention in our cross-run experiment. The strong correlation indicates the attention mechanism is not semantically specific. 4) The x-axis is the same as in plot 3, and the y-axis is attention paid by the final-token query-vector to forbidden-token key-vectors when we randomize the position-embedding of the input activations to the key-circuit. The strong correlation indicates the attention mechanism is not positionally specific. See Figure \ref{['fig:3-more']} for data on more heads.
  • Figure C.1: Top: Copied from Figure \ref{['fig:cumulative-effect']}. Bottom: The distribution of the log odds ratio between the probability of Llama2-7b-chat completing the right answer on a competing prompt vs. a matching noncompeting prompt. The mean log odds ratio is $-3.055$, which translates to over a $1000\times$ odds decrease.
  • Figure C.2: The top two plots perform the same analysis as Figure \ref{['fig:cumulative-effect']} for Llama-2-13b-chat. The overall behavior is roughly the same. The total suppression effect is larger, at 4.26, and 36 components are needed for the full suppression effect.
  • ...and 7 more figures

Theorems & Definitions (2)

  • Theorem D.1
  • proof