Table of Contents
Fetching ...

From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

Xuan Gong, Hanbo Huang, Shiyu Liang

TL;DR

The paper addresses a factuality gap that arises when fine-tuning LLMs on known versus unknown knowledge. It combines systematic experiments with a graph-theoretic framing of knowledge graphs to explain how fine-tuning alters one-hop connectivity and how test-time prompts can modulate this effect. The authors show that in-context learning, particularly few-shot prompts and Chain-of-Thought reasoning, can mitigate or even reverse the gap in many settings, and they formalize this via a graph augmentation model where $Δ_{ ext{fact}} \\propto |\mathcal{E}_{\text{kn}}| - |\mathcal{E}_{\text{unk}}| > 0$ and $\,\\Delta_{ ext{fact}}^* \\le \\Delta_{ ext{fact}}$. They also demonstrate practical benefits of ICL for knowledge extraction under limited supervision and argue for including ICL effects in evaluating data-selection strategies, with implications for prompt design and model deployment in knowledge-intensive tasks.

Abstract

Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.

From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

TL;DR

The paper addresses a factuality gap that arises when fine-tuning LLMs on known versus unknown knowledge. It combines systematic experiments with a graph-theoretic framing of knowledge graphs to explain how fine-tuning alters one-hop connectivity and how test-time prompts can modulate this effect. The authors show that in-context learning, particularly few-shot prompts and Chain-of-Thought reasoning, can mitigate or even reverse the gap in many settings, and they formalize this via a graph augmentation model where and . They also demonstrate practical benefits of ICL for knowledge extraction under limited supervision and argue for including ICL effects in evaluating data-selection strategies, with implications for prompt design and model deployment in knowledge-intensive tasks.

Abstract

Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.

Paper Structure

This paper contains 43 sections, 4 theorems, 38 equations, 6 figures, 7 tables.

Key Result

Lemma 4.1

Let $\mathcal{D}_r$ be the training dataset for relation $r$. For a knowledge triple $k = (s, r, a) \in \mathcal{D}_{r}$, let $\mathcal{G}_s = (\mathcal{V}_s, \mathcal{E}_s^{\text{sim}})$ and $\mathcal{G}_a = (\mathcal{V}_a, \mathcal{E}_a^{\text{sim}})$ be the subgraphs connected to $s$ and $a$ via

Figures (6)

  • Figure 1: Overview: In-context learning (ICL) prompts can help reduce the factuality gap, as they enhance the connectivity of the graph of the FT-Unknown LLM by incorporating demonstrations like $(s', a')$, thereby narrowing the factuality gap. FT-Unknown LLM and FT-Known LLM refer to LLM fine-tuned on unknown and known knowledge, respectively.
  • Figure 2: Memorizing a known knowledge triple $(s_0, r, a_0)$ generalizes to memorizing $(s_1, r, a_1)$ but memorizing an unknown knowledge triple $(s_2, r, a_2)$ can not generalize.
  • Figure 3: Ablation study of few-shot examples and CoT.
  • Figure 4: Ablation study of prompt formulation. We use three levels of rephrasing: Minor, Moderate, Radical.
  • Figure 5: In an LLM fine-tuned on unknown knowledge (left), the demonstration $(s', r, a')$ introduces new edges $(s_0, a_0)$ and $(s_2, a_2)$. In contrast, for the LLM fine-tuned on known knowledge (right), these edges already exist and thus are not newly added. Consequently, the factuality gap narrows as the difference in the number of edges between the two graphs decreases.
  • ...and 1 more figures

Theorems & Definitions (14)

  • Definition 3.1: Unknown Knowledge
  • Definition 3.2: Known Knowledge
  • Lemma 4.1: Memorizing Knowledge as an Edge-Completion Process
  • Remark 1
  • Theorem 4.1: Factuality Gap as a Connectivity Gap in Knowledge Graphs
  • Remark 2
  • Theorem 4.2: Decay of Factuality Gap Under Distributional Shift
  • Remark 3
  • Theorem 5.1: ICL Prompt Can Mitigate the Factuality Gap
  • Remark 4
  • ...and 4 more