Table of Contents
Fetching ...

Understanding Finetuning for Factual Knowledge Extraction

Gaurav Ghosal, Tatsunori Hashimoto, Aditi Raghunathan

TL;DR

The paper reveals that finetuning a language model on well-known facts can improve downstream factuality, while focusing finetuning on lesser-known facts can suppress pretrained knowledge and reduce factual accuracy. A formal notion of factual salience, defined as $S(s,r,a)=\phi(a)^\top W_V \phi(s)$, ties how strongly a fact is stored to its susceptibility to attention imbalance, providing a mechanistic explanation via a one-layer transformer. The authors support the theory with synthetic experiments and validate it on real LLMs (Llama-2-7B, Mistral-7B) and QA benchmarks (PopQA, Entity Questions, MMLU), showing that finetuning on the top 50% well-known facts can match or exceed finetuning on the full dataset, while finetuning on the bottom 50% often degrades performance by several percentage points. The work highlights that pretraining data distribution and fact storage interact intricately with finetuning dynamics, offering practical guidance for data curation and potential regularization or curriculum strategies to improve knowledge extraction in QA tasks.

Abstract

In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.

Understanding Finetuning for Factual Knowledge Extraction

TL;DR

The paper reveals that finetuning a language model on well-known facts can improve downstream factuality, while focusing finetuning on lesser-known facts can suppress pretrained knowledge and reduce factual accuracy. A formal notion of factual salience, defined as , ties how strongly a fact is stored to its susceptibility to attention imbalance, providing a mechanistic explanation via a one-layer transformer. The authors support the theory with synthetic experiments and validate it on real LLMs (Llama-2-7B, Mistral-7B) and QA benchmarks (PopQA, Entity Questions, MMLU), showing that finetuning on the top 50% well-known facts can match or exceed finetuning on the full dataset, while finetuning on the bottom 50% often degrades performance by several percentage points. The work highlights that pretraining data distribution and fact storage interact intricately with finetuning dynamics, offering practical guidance for data curation and potential regularization or curriculum strategies to improve knowledge extraction in QA tasks.

Abstract

In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.
Paper Structure (37 sections, 9 theorems, 36 equations, 7 figures, 4 tables)

This paper contains 37 sections, 9 theorems, 36 equations, 7 figures, 4 tables.

Key Result

Theorem 4.2

For pretraining data $D_\text{pre}$, where all $a$ appear at least once, suppose there exists a value matrix $W_\text{V}$ satisfying mild assumptions assump:nonuniformlabel to assump:allfactmemorized. Then the one-layer transformer $f(s,p_{r};W_\text{V},0)$ achieves $100\%$ accuracy under $\mathop{\

Figures (7)

  • Figure 1: Conceptual Mechanism of Finetuning on Popular versus Unpopular Knowledge. When finetuning on less popular knowledge, the model can learn to heavily upweight relation features which enables it to make a plausible guess about the correct answer. However, training on popular, well-encoded facts discourages this imbalance. At testing time, heavy reliance on relation features can result in less popular knowledge being overwritten.
  • Figure 2: Simulation Study of Finetuning for Knowledge Extraction (a) We plot the downstream factuality of finetuning on more versus less popular facts, finding that finetuning on more popular facts improves downstream factuality (b) We plot the difference between finetuning on FT-Top and FT-Bottom as a function of the subject Zipf parameter. We find that on increasingly long-tailed datasets, the impact of finetuning dataset is amplified. (c) We plot the difference between finetuning on FT-Top and FT-Bottom as a function of pretraining steps, finding that the difference between the finetuning datasets is mitigated with additional training.
  • Figure 3: Analysis of Llama-7B Attention Patterns (a) We plot the maximum attention score over subject tokens for Llama-7B models finetuned on FT-Top and FT-Bottom across layers, where the maximum attention score is averaged over the heads in each layer. All results are averaged over examples in the PopQA-Controlled test set. (b) We compare the attention patterns for a specific question between the FT-Top and FT-Bottom fine-tuned models. The tokens corresponding to the subject are enclosed within the green rectangle.
  • Figure 4: PopQA-Controlled Test Accuracy on Popularity Percentiles We plot the accuracy on the top $x$ popularity percentiles of the PopQA-Controlled test set as a function of $x$. We compare the performance of finetuning on FT-Top versus FT-Bottom. We observe that while both finetuning datasets perform comparably on the most popular facts in the test set, training on the less popular data significantly underperforms on relatively less popular test questions.
  • Figure 5: Finetuning Performance on Real Datasets We plot the factual QA accuracy across two models and question-answering datasets under different fine-tuning strategies. FT-Top denotes finetuning on the most popular half of data, FT-Whole denotes finetuning on the whole training dataset, FT-Random denotes finetuning on a randomly selected half of the data, and FT-Bottom denotes finetuning on the lower 50% of the data, sorted by popularity. We plot performance restricting to the top-$x$ popularity percentiles of the test set.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Definition 4.1: Fact Salience
  • Theorem 4.2: Attention imbalance can lead to hidden knowledge
  • Definition 4.3: Subject Token Relevance
  • Definition 4.4: Relation Token Relevance
  • Theorem 4.5: Factuality vs. Nonfactuality Inducing Gradients
  • Theorem 4.6: Lower bound on fact salience
  • Theorem 1.1: One-layer transformer can fully memorize the pretraining dataset
  • proof
  • Theorem 1.5: Attention imbalance can lead to hidden information
  • proof
  • ...and 8 more