Table of Contents
Fetching ...

Establishing Knowledge Preference in Language Models

Sizhe Zhou, Sha Li, Yu Meng, Yizhu Jiao, Heng Ji, Jiawei Han

TL;DR

The paper presents a formal three-source knowledge framework for language models, introducing Instruction Knowledge, Context Knowledge, and Parametric Knowledge with a hierarchical preference (Instruction > Context > Parametric). It constructs a comprehensive benchmark (adapting IfQA, MQuAKE, MRQA) to evaluate adherence to this hierarchy and proposes a data synthesis pipeline that uses Wikipedia/Wikidata and GPT-4o to produce instruction-tuning data (HierPref). A 7B model fine-tuned with a few thousand HierPref examples achieves substantial gains across benchmarks, including robustness to noisy context and multi-hop reasoning. This work provides a practical pathway to control how LLMs utilize external and user-provided knowledge, with broad implications for retrieval-augmented generation, model editing, and user-specific QA.

Abstract

Language models are known to encode a great amount of factual knowledge through pretraining. However, such knowledge might be insufficient to cater to user requests, requiring the model to integrate external knowledge sources and adhere to user-provided specifications. When answering questions about ongoing events, the model should use recent news articles to update its response; when asked to provide recommendations, the model should prioritize user specifications over retrieved product reviews; when some facts are edited in the model, the updated facts should override all prior knowledge learned by the model even if they are conflicting. In all of the cases above, the model faces a decision between its own parametric knowledge, (retrieved) contextual knowledge, and user instruction knowledge. In this paper, we (1) unify such settings into the problem of knowledge preference and define a three-level preference hierarchy over these knowledge sources; (2) compile a collection of existing datasets IfQA, MQuAKE, and MRQA covering a combination of settings (with/without user specifications, with/without context documents) to systematically evaluate how well models obey the intended knowledge preference; and (3) propose a dataset synthesis method that composes diverse question-answer pairs with user assumptions and related context to directly fine-tune LMs for instilling the hierarchy of knowledge. We demonstrate that a 7B model, fine-tuned on only a few thousand examples automatically generated by our proposed method, effectively achieves superior performance (more than 18% improvement across all evaluation benchmarks) in adhering to the desired knowledge preference hierarchy.

Establishing Knowledge Preference in Language Models

TL;DR

The paper presents a formal three-source knowledge framework for language models, introducing Instruction Knowledge, Context Knowledge, and Parametric Knowledge with a hierarchical preference (Instruction > Context > Parametric). It constructs a comprehensive benchmark (adapting IfQA, MQuAKE, MRQA) to evaluate adherence to this hierarchy and proposes a data synthesis pipeline that uses Wikipedia/Wikidata and GPT-4o to produce instruction-tuning data (HierPref). A 7B model fine-tuned with a few thousand HierPref examples achieves substantial gains across benchmarks, including robustness to noisy context and multi-hop reasoning. This work provides a practical pathway to control how LLMs utilize external and user-provided knowledge, with broad implications for retrieval-augmented generation, model editing, and user-specific QA.

Abstract

Language models are known to encode a great amount of factual knowledge through pretraining. However, such knowledge might be insufficient to cater to user requests, requiring the model to integrate external knowledge sources and adhere to user-provided specifications. When answering questions about ongoing events, the model should use recent news articles to update its response; when asked to provide recommendations, the model should prioritize user specifications over retrieved product reviews; when some facts are edited in the model, the updated facts should override all prior knowledge learned by the model even if they are conflicting. In all of the cases above, the model faces a decision between its own parametric knowledge, (retrieved) contextual knowledge, and user instruction knowledge. In this paper, we (1) unify such settings into the problem of knowledge preference and define a three-level preference hierarchy over these knowledge sources; (2) compile a collection of existing datasets IfQA, MQuAKE, and MRQA covering a combination of settings (with/without user specifications, with/without context documents) to systematically evaluate how well models obey the intended knowledge preference; and (3) propose a dataset synthesis method that composes diverse question-answer pairs with user assumptions and related context to directly fine-tune LMs for instilling the hierarchy of knowledge. We demonstrate that a 7B model, fine-tuned on only a few thousand examples automatically generated by our proposed method, effectively achieves superior performance (more than 18% improvement across all evaluation benchmarks) in adhering to the desired knowledge preference hierarchy.
Paper Structure (43 sections, 8 figures, 23 tables)

This paper contains 43 sections, 8 figures, 23 tables.

Figures (8)

  • Figure 1: Examples of Instruction Knowledge, Context Knowledge and Parametric Knowledge. The conflicted parts are highlighted. The conflict between the instruction knowledge and the context knowledge lies in the conflicted timestamps. The conflict between the context knowledge and the parametric knowledge lies in the conflicted factual knowledge.
  • Figure 2: Source Data Collection step of HierPref synthesis framework.
  • Figure 3: Modeling Preference for Instruction Knowledge step of HierPref synthesis framework. C.F. denotes Counter Factual.
  • Figure 4: Modeling Preference for Context Knowledge step of HierPref synthesis framework. Data Synthesis for Prioritizing Instruction Knowledge of Fig. \ref{['fig:framework_step2']} and Data Synthesis for Prioritizing Context Knowledge here share the same example source data in Fig. \ref{['fig:framework_step1']}. In implementation, two stages' source data have no overlap.
  • Figure 5: Evaluation scores on IfQA test set of the full split. Note that G denotes that the training data is from IfQA's train set while S denotes that the training data is from HierPref synthesized single-hop QA set. The number before G or S represents the corresponding size of data used.
  • ...and 3 more figures