Table of Contents
Fetching ...

Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

Essa Jan, Moiz Ali, Muhammad Saram Hassan, Fareed Zaffar, Yasir Zaki

TL;DR

The paper investigates how the fine-tuning task type affects knowledge retention in LLMs when injecting post-cutoff factual information. By comparing comprehension-based tasks (QA, blanks) to token-mapping tasks (translation, text-to-JSON) across multiple models and sizes, it shows that comprehension tasks yield substantially higher retention, and that larger models provide broader improvements but still struggle with semantic generalization. The authors deploy a GPT-4o-Mini judge to evaluate direct and generic (contextual) questions and reveal a persistent gap between recall and transfer, underscoring limits of current fine-tuning for deep semantic integration. Practically, the work emphasizes task selection and cognitive engagement as key levers for effective, scalable knowledge updates in LLMs, with scaling offering improvements but not a complete solution.

Abstract

As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.

Data Doping or True Intelligence? Evaluating the Transferability of Injected Knowledge in LLMs

TL;DR

The paper investigates how the fine-tuning task type affects knowledge retention in LLMs when injecting post-cutoff factual information. By comparing comprehension-based tasks (QA, blanks) to token-mapping tasks (translation, text-to-JSON) across multiple models and sizes, it shows that comprehension tasks yield substantially higher retention, and that larger models provide broader improvements but still struggle with semantic generalization. The authors deploy a GPT-4o-Mini judge to evaluate direct and generic (contextual) questions and reveal a persistent gap between recall and transfer, underscoring limits of current fine-tuning for deep semantic integration. Practically, the work emphasizes task selection and cognitive engagement as key levers for effective, scalable knowledge updates in LLMs, with scaling offering improvements but not a complete solution.

Abstract

As the knowledge of large language models (LLMs) becomes outdated over time, there is a growing need for efficient methods to update them, especially when injecting proprietary information. Our study reveals that comprehension-intensive fine-tuning tasks (e.g., question answering and blanks) achieve substantially higher knowledge retention rates (48%) compared to mapping-oriented tasks like translation (17%) or text-to-JSON conversion (20%), despite exposure to identical factual content. We demonstrate that this pattern persists across model architectures and follows scaling laws, with larger models showing improved retention across all task types. However, all models exhibit significant performance drops when applying injected knowledge in broader contexts, suggesting limited semantic integration. These findings show the importance of task selection in updating LLM knowledge, showing that effective knowledge injection relies not just on data exposure but on the depth of cognitive engagement during fine-tuning.

Paper Structure

This paper contains 18 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Direct question accuracy across Qwen2.5 model sizes and tasks.
  • Figure 2: GPT-4o-Mini judge prompt.