Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions
Oded Ovadia, Meni Brief, Rachel Lemberg, Eitam Sheetrit
TL;DR
Knowledge-Instruct tackles the challenge of injecting niche knowledge into LLMs with limited data by turning small corpora into dense instruction-based training data. It presents a six-step pipeline—entity extraction, factual extraction, contextualization, deduplication, paraphrasing, and instruction conversion—culminating in instruction-tuned fine-tuning that preserves broad capabilities while acquiring new facts. Across datasets including a new Companies benchmark, it demonstrates superior factual memorization, reduced catastrophic forgetting, and improved context handling, especially in retrieval-augmented and multi-hop scenarios. The approach is cost-effective, scalable with small models for data generation, and offers practical implications for domain-specific, knowledge-intensive applications, while acknowledging limitations related to prompting quality and long-term retention.
Abstract
While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.
