Table of Contents
Fetching ...

Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

Oded Ovadia, Meni Brief, Rachel Lemberg, Eitam Sheetrit

TL;DR

Knowledge-Instruct tackles the challenge of injecting niche knowledge into LLMs with limited data by turning small corpora into dense instruction-based training data. It presents a six-step pipeline—entity extraction, factual extraction, contextualization, deduplication, paraphrasing, and instruction conversion—culminating in instruction-tuned fine-tuning that preserves broad capabilities while acquiring new facts. Across datasets including a new Companies benchmark, it demonstrates superior factual memorization, reduced catastrophic forgetting, and improved context handling, especially in retrieval-augmented and multi-hop scenarios. The approach is cost-effective, scalable with small models for data generation, and offers practical implications for domain-specific, knowledge-intensive applications, while acknowledging limitations related to prompting quality and long-term retention.

Abstract

While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.

Knowledge-Instruct: Effective Continual Pre-training from Limited Data using Instructions

TL;DR

Knowledge-Instruct tackles the challenge of injecting niche knowledge into LLMs with limited data by turning small corpora into dense instruction-based training data. It presents a six-step pipeline—entity extraction, factual extraction, contextualization, deduplication, paraphrasing, and instruction conversion—culminating in instruction-tuned fine-tuning that preserves broad capabilities while acquiring new facts. Across datasets including a new Companies benchmark, it demonstrates superior factual memorization, reduced catastrophic forgetting, and improved context handling, especially in retrieval-augmented and multi-hop scenarios. The approach is cost-effective, scalable with small models for data generation, and offers practical implications for domain-specific, knowledge-intensive applications, while acknowledging limitations related to prompting quality and long-term retention.

Abstract

While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.

Paper Structure

This paper contains 33 sections, 9 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Visualization of the Knowledge-Instruct framework. A small text corpus is transformed into a set of information-dense instructions following the steps outlined in \ref{['subsec:methodology']}.
  • Figure 2: Number of facts extracted by different LLMs at various stages of the Knowledge-Instruct process. The extraction is performed in three rounds: an initial extraction followed by two iterative verification passes, where the model identifies any missed facts. The results are reported for the Companies dataset, with 'Total Facts' representing the cumulative count across all rounds and 'Unique Facts' indicating distinct, non-redundant extractions. The final model accuracy on the benchmark is provided in the legend.
  • Figure 3: A comparison of Knowledge-Instruct with Synthetic CPT. Synthetic CPT without further SFT is better at the new domain, at the expense of instruction following, and vice versa. The base model, Llama, is shown for reference.
  • Figure 4: Effect of the paraphrasing step in Knowledge-Instruct on accuracy using the Companies dataset with Llama.
  • Figure 5: A full list of all rule-based templates used to conver raw facts into SFT-compatible samples..
  • ...and 13 more figures