Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning
Nick Mecklenburg, Yiyou Lin, Xiaoxiao Li, Daniel Holstein, Leonardo Nunes, Sara Malvar, Bruno Silva, Ranveer Chandra, Vijay Aski, Pavan Kumar Reddy Yannam, Tolga Aktas, Todd Hendry
TL;DR
This work probes how supervised fine-tuning can inject new, post-cutoff knowledge into large language models by comparing two data-generation strategies: token-based scaling (quantity-focused) and fact-based scaling (coverage-focused). Using GPT-4 with LoRA, the authors show that token-based scaling can boost Q&A accuracy but suffers from uneven fact coverage and diminishing returns at high scale, while fact-based scaling delivers more uniform knowledge ingestion and steadier gains. The study also benchmarks against RAG and analyzes cross-validation between token- and fact-based regimes, highlighting the importance of dataset design for knowledge retention. Overall, the findings advocate for fact-aware dataset construction to achieve robust domain adaptation in LLMs and inform practical knowledge ingestion strategies for post-cutoff domains like sports.
Abstract
In recent years, Large Language Models (LLMs) have shown remarkable performance in generating human-like text, proving to be a valuable asset across various applications. However, adapting these models to incorporate new, out-of-domain knowledge remains a challenge, particularly for facts and events that occur after the model's knowledge cutoff date. This paper investigates the effectiveness of Supervised Fine-Tuning (SFT) as a method for knowledge injection in LLMs, specifically focusing on the domain of recent sporting events. We compare different dataset generation strategies -- token-based and fact-based scaling -- to create training data that helps the model learn new information. Our experiments on GPT-4 demonstrate that while token-based scaling can lead to improvements in Q&A accuracy, it may not provide uniform coverage of new knowledge. Fact-based scaling, on the other hand, offers a more systematic approach to ensure even coverage across all facts. We present a novel dataset generation process that leads to more effective knowledge ingestion through SFT, and our results show considerable performance improvements in Q&A tasks related to out-of-domain knowledge. This study contributes to the understanding of domain adaptation for LLMs and highlights the potential of SFT in enhancing the factuality of LLM responses in specific knowledge domains.
