Table of Contents
Fetching ...

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning

Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Malvar, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, Tolga Aktas, Todd Hendry

TL;DR

This work probes how supervised fine-tuning can inject new, post-cutoff knowledge into large language models by comparing two data-generation strategies: token-based scaling (quantity-focused) and fact-based scaling (coverage-focused). Using GPT-4 with LoRA, the authors show that token-based scaling can boost Q&A accuracy but suffers from uneven fact coverage and diminishing returns at high scale, while fact-based scaling delivers more uniform knowledge ingestion and steadier gains. The study also benchmarks against RAG and analyzes cross-validation between token- and fact-based regimes, highlighting the importance of dataset design for knowledge retention. Overall, the findings advocate for fact-aware dataset construction to achieve robust domain adaptation in LLMs and inform practical knowledge ingestion strategies for post-cutoff domains like sports.

Abstract

In recent years, Large Language Models (LLMs) have shown remarkable performance in generating human-like text, proving to be a valuable asset across various applications. However, adapting these models to incorporate new, out-of-domain knowledge remains a challenge, particularly for facts and events that occur after the model's knowledge cutoff date. This paper investigates the effectiveness of Supervised Fine-Tuning (SFT) as a method for knowledge injection in LLMs, specifically focusing on the domain of recent sporting events. We compare different dataset generation strategies -- token-based and fact-based scaling -- to create training data that helps the model learn new information. Our experiments on GPT-4 demonstrate that while token-based scaling can lead to improvements in Q&A accuracy, it may not provide uniform coverage of new knowledge. Fact-based scaling, on the other hand, offers a more systematic approach to ensure even coverage across all facts. We present a novel dataset generation process that leads to more effective knowledge ingestion through SFT, and our results show considerable performance improvements in Q&A tasks related to out-of-domain knowledge. This study contributes to the understanding of domain adaptation for LLMs and highlights the potential of SFT in enhancing the factuality of LLM responses in specific knowledge domains.

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning

TL;DR

This work probes how supervised fine-tuning can inject new, post-cutoff knowledge into large language models by comparing two data-generation strategies: token-based scaling (quantity-focused) and fact-based scaling (coverage-focused). Using GPT-4 with LoRA, the authors show that token-based scaling can boost Q&A accuracy but suffers from uneven fact coverage and diminishing returns at high scale, while fact-based scaling delivers more uniform knowledge ingestion and steadier gains. The study also benchmarks against RAG and analyzes cross-validation between token- and fact-based regimes, highlighting the importance of dataset design for knowledge retention. Overall, the findings advocate for fact-aware dataset construction to achieve robust domain adaptation in LLMs and inform practical knowledge ingestion strategies for post-cutoff domains like sports.

Abstract

In recent years, Large Language Models (LLMs) have shown remarkable performance in generating human-like text, proving to be a valuable asset across various applications. However, adapting these models to incorporate new, out-of-domain knowledge remains a challenge, particularly for facts and events that occur after the model's knowledge cutoff date. This paper investigates the effectiveness of Supervised Fine-Tuning (SFT) as a method for knowledge injection in LLMs, specifically focusing on the domain of recent sporting events. We compare different dataset generation strategies -- token-based and fact-based scaling -- to create training data that helps the model learn new information. Our experiments on GPT-4 demonstrate that while token-based scaling can lead to improvements in Q&A accuracy, it may not provide uniform coverage of new knowledge. Fact-based scaling, on the other hand, offers a more systematic approach to ensure even coverage across all facts. We present a novel dataset generation process that leads to more effective knowledge ingestion through SFT, and our results show considerable performance improvements in Q&A tasks related to out-of-domain knowledge. This study contributes to the understanding of domain adaptation for LLMs and highlights the potential of SFT in enhancing the factuality of LLM responses in specific knowledge domains.
Paper Structure (17 sections, 7 figures, 1 table)

This paper contains 17 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Token-based evaluation set accuracy for our six documents across 1x, 5x, and 10x scaling with models trained on token-scaled datasets. The base model results with no training are included under the bars annotated as "original," and we include a RAG baseline as well which leverages the cleaned document sections to answer the eval questions.
  • Figure 2: Fact coverage across token-based datasets.
  • Figure 3: Fact-based evaluation set accuracy for our six documents across 1x, 5x, and 10x scaling with models trained on fact-scaled datasets. The base model results with no training are included under the bars annotated as original," and we include a RAG baseline as well which leverages the cleaned document sections to answer the eval questions.
  • Figure 4: Fact-based evaluation set accuracy for our six documents across 1x, 5x, and 10x scaling with models trained on token-scaled datasets.
  • Figure 5: Fact scaling for 3- and 6- epochs. Note we see an anomaly in the Cricket 5x 6 epoch configuration where due to an unoptimized learning rate and our implementation which selects the last model checkpoint only, the end model falls into an extremely low training loss regime that leads to overfitting. In the random optimization over the 10x dataset, this does not occur.
  • ...and 2 more figures