Table of Contents
Fetching ...

Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Ryoma Ishigaki, Yuta Hitomi, Kentaro Inui

TL;DR

This paper investigates why two-stage training promotes memorization while mixed BIO+QA training teaches generalizable fact recall in language models. It introduces the cross-task gradient trace to identify shared parameters influenced by both fact-storing and fact-recalling data, and demonstrates that mixed training yields more and more centralized shared parameters than two-stage training. These shared parameters concentrate in critical attention heads and a subset of MLP neurons, forming a fact-recall circuit; targeted interventions (ablation, grafting, and circuit analysis) reveal their pivotal role in cross-form recall and parameter-efficient generalization. The findings illuminate the internal mechanisms of cross-task learning and offer a model-agnostic tool for interpretable analysis of fact recall, with implications for designing training strategies that promote knowledge teaching.

Abstract

Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.

Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

TL;DR

This paper investigates why two-stage training promotes memorization while mixed BIO+QA training teaches generalizable fact recall in language models. It introduces the cross-task gradient trace to identify shared parameters influenced by both fact-storing and fact-recalling data, and demonstrates that mixed training yields more and more centralized shared parameters than two-stage training. These shared parameters concentrate in critical attention heads and a subset of MLP neurons, forming a fact-recall circuit; targeted interventions (ablation, grafting, and circuit analysis) reveal their pivotal role in cross-form recall and parameter-efficient generalization. The findings illuminate the internal mechanisms of cross-task learning and offer a model-agnostic tool for interpretable analysis of fact recall, with implications for designing training strategies that promote knowledge teaching.

Abstract

Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.

Paper Structure

This paper contains 31 sections, 9 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: (a) Performance of fine-tuned Llama and Pythia on QA out-of-distribution set. The Mix-tuned model substantially outperforms the Stage-tuned model, demonstrating superior generalization in the fact recall task.(b) Overview of the proposed tool. The fine-tuned model ($\theta^{\text{task}}$) first performs a forward pass to compute task-specific loss. During the backward pass (i.e., backpropagation), we track gradients for each parameter ($\theta$) and identify shared parameters—those strongly influenced by both BIO and QA tasks. After fine-tuning, we apply grafting to locate fact recall-related parameters and perform ablation to evaluate the role of shared parameters in this subset.
  • Figure 2: Shared parameters ($\mathcal{S}$) in Llama: distribution and impact. (a) Mixed training yields more shared parameters than two-stage training. (b) Mix-tuned models show a larger accuracy drop after ablation, demonstrating their impact.
  • Figure 3: (a) Grafting procedure.(b) Number of fact recall-related parameters ($|\mathcal{\gamma}|_{0}$) in grafted models. Mix-tuned models include fewer fact recall-related parameters than Stage-tuned models.
  • Figure 4: Intervention results on fine-grained attention heads. The Shared Size metric most effectively identifies minimal sufficient heads in (a) and most critical heads in (b) for fact recall circuits. (c) shows the corresponding fraction of intervened parameters.
  • Figure 5: Attention pattern of grafted Mix-tuned Llama in Layer 21, Head 17. Left: attention pattern and output logits for the BIO input: Alexandra Leblanc was welcomed into life on April 11, 1982... Right: attention pattern and logits for the QA input from the same individual, whose QA is out-of-distribution. The model successfully recalls the attribute Andrew Jackson University. Colored rectangles highlight consistent attention behaviors across BIO and QA (e.g., purple = subject linking, blue = relation focusing). Output logits are computed by mapping the output states of the final token to the vocabulary space. A zoomed-in view is shown in the Figure \ref{['fig:full_vision_manual_head_L21H17_mixed']}, Appendix \ref{['appendix:attention_pattern']}.
  • ...and 14 more figures

Theorems & Definitions (1)

  • Definition 3.1: Shared Parameters