From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs
Xuan Gong, Hanbo Huang, Shiyu Liang
TL;DR
The paper addresses a factuality gap that arises when fine-tuning LLMs on known versus unknown knowledge. It combines systematic experiments with a graph-theoretic framing of knowledge graphs to explain how fine-tuning alters one-hop connectivity and how test-time prompts can modulate this effect. The authors show that in-context learning, particularly few-shot prompts and Chain-of-Thought reasoning, can mitigate or even reverse the gap in many settings, and they formalize this via a graph augmentation model where $Δ_{ ext{fact}} \\propto |\mathcal{E}_{\text{kn}}| - |\mathcal{E}_{\text{unk}}| > 0$ and $\,\\Delta_{ ext{fact}}^* \\le \\Delta_{ ext{fact}}$. They also demonstrate practical benefits of ICL for knowledge extraction under limited supervision and argue for including ICL effects in evaluating data-selection strategies, with implications for prompt design and model deployment in knowledge-intensive tasks.
Abstract
Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.
