Table of Contents
Fetching ...

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Kexian Tang, Jiani Wang, Shaowen Wang, Kaifeng Lyu

Abstract

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.

SPA: A Simple but Tough-to-Beat Baseline for Knowledge Injection

Abstract

While large language models (LLMs) are pretrained on massive amounts of data, their knowledge coverage remains incomplete in specialized, data-scarce domains, motivating extensive efforts to study synthetic data generation for knowledge injection. We propose SPA (Scaling Prompt-engineered Augmentation), a simple but tough-to-beat baseline that uses a small set of carefully designed prompts to generate large-scale synthetic data for knowledge injection. Through systematic comparisons, we find that SPA outperforms several strong baselines. Furthermore, we identify two key limitations of prior approaches: (1) while RL-based methods may improve the token efficiency of LLM-based data augmentation at small scale, they suffer from diversity collapse as data scales, leading to diminishing returns; and (2) while multi-stage prompting may outperform simple augmentation methods, their advantages can disappear after careful prompt tuning. Our results suggest that, for knowledge injection, careful prompt design combined with straightforward large-scale augmentation can be surprisingly effective, and we hope SPA can serve as a strong baseline for future studies in this area. Our code is available at https://github.com/Tangkexian/SPA.
Paper Structure (75 sections, 4 figures, 8 tables)

This paper contains 75 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of Scaling Prompt-engineered Augmentation (SPA). Our baseline method rewrites a small source corpus into a large synthetic corpus by repeatedly prompting a generator with a fixed set of seven human-curated prompt templates, which are designed based on three levels of learning strategies: Concept Learning, Critical Thinking, and Generative Learning.
  • Figure 2: Scaling Curve on SQuAD shows that SPA exhibits strong and consistent scaling behavior. The y-axis represents QA accuracy, and the x-axis represents the synthetic token budget. Note: the PaST data point corresponds to the best performance reported in the original paper.
  • Figure 3: Scaling Curve on QuALITY shows that SPA achieves the strongest scaling performance among compared methods as synthetic data scales. The y-axis represents QA accuracy, and the x-axis represents the synthetic token budget. Note: For EntiGraph, we use statistics from the original paper, where GPT-4-Turbo is used as the generator, whereas SPA and Active Reading use gpt-oss-120b. The SoG data point corresponds to the best performance reported in the original paper, which uses a stronger base model Llama-3.1-8B-Instruct.
  • Figure 4: Document-level comparison shows that SPA achieves higher average strategy effectiveness than Active Reading on QuALITY. The table reports the average accuracy (%) of each method across all strategies for each document, showing that SPA consistently attains higher mean accuracy than Active Reading. Bold numbers indicate cases where SPA outperforms Active Reading. The subplots visualize the accuracy of individual strategies for each document, including seven strategies for SPA and a variable number for Active Reading. In each subplot, wide bars denote the average accuracy across all strategies within each method, and the narrow bars denote the accuracy of individual strategies. The gray dashed line denotes the base model's accuracy.