Table of Contents
Fetching ...

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

Bilal Chughtai, Alan Cooney, Neel Nanda

TL;DR

This study investigates how LLMs perform factual recall and reveals that recall results from an additive motif: four independent mechanisms (Subject Heads, Relation Heads, Mixed Heads, and MLPs) each contributing positively to the correct attribute. Using Direct Logit Attribution decomposed by source tokens, the authors show constructive interference among these mechanisms, with the correct answer emerging from their combined effects rather than any single component. The work leverages a CounterFact-inspired dataset and analyzes Pythia-2.8b to characterize head types and their source-token contributions, highlighting propagation between subject and relation and the role of MLPs in boosting relation attributes. The findings have implications for mechanistic interpretability, suggesting that robust recall relies on multiple, parallel circuits and that understanding these additive pathways can inform editing, prompting, and safety considerations in language models.

Abstract

How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several distinct, independent, and qualitatively different mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the additive motif: models compute through summing up multiple independent contributions. Each mechanism's contribution may be insufficient alone, but summing results in constructive interfere on the correct answer. In addition, we extend the method of direct logit attribution to attribute an attention head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens.

Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs

TL;DR

This study investigates how LLMs perform factual recall and reveals that recall results from an additive motif: four independent mechanisms (Subject Heads, Relation Heads, Mixed Heads, and MLPs) each contributing positively to the correct attribute. Using Direct Logit Attribution decomposed by source tokens, the authors show constructive interference among these mechanisms, with the correct answer emerging from their combined effects rather than any single component. The work leverages a CounterFact-inspired dataset and analyzes Pythia-2.8b to characterize head types and their source-token contributions, highlighting propagation between subject and relation and the role of MLPs in boosting relation attributes. The findings have implications for mechanistic interpretability, suggesting that robust recall relies on multiple, parallel circuits and that understanding these additive pathways can inform editing, prompting, and safety considerations in language models.

Abstract

How do transformer-based large language models (LLMs) store and retrieve knowledge? We focus on the most basic form of this task -- factual recall, where the model is tasked with explicitly surfacing stored facts in prompts of form `Fact: The Colosseum is in the country of'. We find that the mechanistic story behind factual recall is more complex than previously thought. It comprises several distinct, independent, and qualitatively different mechanisms that additively combine, constructively interfering on the correct attribute. We term this generic phenomena the additive motif: models compute through summing up multiple independent contributions. Each mechanism's contribution may be insufficient alone, but summing results in constructive interfere on the correct answer. In addition, we extend the method of direct logit attribution to attribute an attention head's output to individual source tokens. We use this technique to unpack what we call `mixed heads' -- which are themselves a pair of two separate additive updates from different source tokens.
Paper Structure (36 sections, 6 equations, 19 figures, 11 tables)

This paper contains 36 sections, 6 equations, 19 figures, 11 tables.

Figures (19)

  • Figure 1: Four independent mechanisms models use for factual recall. (1) Subject heads, (2) Relation Heads, (3) Mixed Heads and (4) MLPs (omitted). These combine additively, constructively interfering to elicit the correct answer. Each mechanism individually is less performant than the sum of them all, with most individual mechanisms incapable of performing the task alone.
  • Figure 2: Three different types of attention head for factual extraction prompts of form $s$ plays the sport of: subject heads, relation heads and mixed heads. (Left) DLA on the correct sport, split by attention head source token. top 10 heads by total DLA shown. Each data point is one prompt. The grey lines have gradients $1/10$ and $10$ and denote the boundary we use to define head types, after aggregating over the relationship $r$. These cleanly separate subject and relation heads. (Right) Attention patterns of the top four heads of each kind on each prompt in the dataset. Subject and Relation heads attend mostly to SUBJECT and RELATION respectively. Mixed heads attend to both. Attention patterns are not used to define head type, but correlate well with the head type.
  • Figure 3: Top heads by absolute DLA on $a$ for the relationship is in the country of. We also plot the mean DLA on the 5 largest magnitude relation attributes in $R - \{a\}$; other countries. Heads labelled as Subject (S), Relation (R) or Mixed (M) heads. Studying a large set of counterfactual attributes, and splitting by attention source token lets us disentangle these head types. All three head types emerge. Subject heads are characterized by the largest column being blue -- among the tokens we study they mostly extract the correct attribute $a$ from SUBJECT. Relation heads have comparable red and purple columns, with small blue and green columns -- among tokens we study they extract a range of relationship attributes in $R$ from RELATION. Mixed heads capture everything remaining.
  • Figure 4: Subject Heads exist for a range of relations. (Top) The mechanism by which subject heads act. They read from enriched subject representations, and copy the relevant attributes to output directions. We show this for a 'sport' head and a 'country' head. Both pathways activate whenever a factual recall prompt with the given subject is presented, no matter what the stated relationship is -- they 'misfire'. No sport is extracted for Stephen Hawking. Raw data for this figure is in the Appendix in Table \ref{['tab:subject_extractor_probing']}. (Bottom) Top two subject heads for four different relationships. These heads individually extract the correct attribute (blue) significantly more than other relation attributes $R$ (red) and other subject attributes $S$ (green). This indicates their category $C$ is mostly narrow. L17H2 is more general, extracting many correlated facts about countries (e.g. country, currency, cities, etc.). These heads also have a high attention ratio to SUBJECT over RELATION (shown in the x axis labels).
  • Figure 5: Relation heads exist for a range of different relationships. (Left) The top two relation heads for four different relationships. The heads extract the correct attribute (blue) about as much as they extract many other attributes in the set $R$ (red). They also have a high attention ratio to RELATION over SUBJECT (shown in the x axis labels). (Right) Many cities are extracted by heads over a range of prompts with relation has the capital city with different subjects. The error bars denote the standard deviation over these subjects. While heads push for some cities more than others, small error bars indicate this variation is consistent across input subjects. This suggests relation head outputs do not causally depend on the subject. We include similar plots for other relationships in Appendix \ref{['app:relation_heads']}.
  • ...and 14 more figures