Table of Contents
Fetching ...

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Seungju Han, Konwoo Kim, Chanwoo Park, Benjamin Newman, Suhas Kotha, Jaehun Jung, James Zou, Yejin Choi

Abstract

Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6%, and achieves a 9.1% gain when combined with RAG.

Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG

Abstract

Synthetic data augmentation helps language models learn new knowledge in data-constrained domains. However, naively scaling existing synthetic data methods by training on more synthetic tokens or using stronger generators yields diminishing returns below the performance of RAG. To break the RAG ceiling, we introduce Synthetic Mixed Training, which combines synthetic QAs and synthetic documents. This leverages their complementary training signals, and enables log-linear improvements as both synthetic data volume and generator strength increase. This allows the model to outperform RAG by a 2.6% relative gain on QuaLITY, a long-document reading comprehension benchmark. In addition, we introduce Focal Rewriting, a simple technique for synthetic document generation that explicitly conditions document generation on specific questions, improving the diversity of synthetic documents and yielding a steeper log-linear scaling curve. On QuaLITY, our final recipe trains a Llama 8B model that outperforms RAG by 4.4% relatively. Across models and benchmarks (QuaLITY, LongHealth, FinanceBench), our training enables models to beat RAG in five of six settings, outperforms by 2.6%, and achieves a 9.1% gain when combined with RAG.
Paper Structure (34 sections, 11 figures, 3 tables)

This paper contains 34 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Naively scaling synthetic data plateaus, but our simple methods allows effective scaling and surpass RAG. We evaluate four synthetic data generation strategies using both 8B and 70B generators, scaling training data up to 700M tokens. Across all four baselines, performance saturates and remains below RAG, showing that simply increasing synthetic data or compute is insufficient. In contrast, our two simple techniques---Synthetic Mixed Training and Focal Rewriting---exhibit clear log-linear scaling with both more data and a stronger generator, ultimately surpassing RAG.
  • Figure 2: (Left) Comparing the data scaling of existing methods: self-generated (8B) synthetic QAs and synthetic documents. This shows QuaLITY accuracy as a function of the number of synthetic training tokens; shaded areas indicate the standard deviation corresponding to the 95% confidence interval, estimated from $n=8$ inference runs. We use Llama 3.1 8B Inst for both data generation and model training. AR indicates Active Reading lin2025learning, EG indicates EntiGraph yangsynthetic, and WRAP indicates rephrasing maini2024rephrasing. On QuALITY, synth QA is substantially more efficient than all existing methods which generate synthetic documents. (Right) Scaling the generator to improve synthetic token efficiency. (1) Scaling the generator to 70B does not improve synth token efficiency for QA, only 0.1% gain over 8B generator at 88M tokens. (2) In contrast, document-based methods do benefit from scaling the generator, achieving 4.5% gain on average. (3) For all methods, even with a stronger generator, data scaling plateaus.
  • Figure 3: Mixing synthetic documents does not help. Mixing different kinds of synthetic documents (blue line) provides a minimal gain over just using AR documents (pink line).
  • Figure 4: Synthetic Mixed Training breaks the RAG ceiling. We combine 70B-generated synthetic QAs and AR documents at a 1:1 ratio, attempting to achieve the best of both worlds. This (skyblue line) yields performance comparable to RAG at 350M synthetic training tokens and ultimately surpasses RAG when scaled to 700M tokens.
  • Figure 5: Synthetic Mixed Training with mixture of domains. Here, the x-axis denotes the number of synthetic tokens grounded in the QuaLITY dataset. (1) Mixing 50% synthetic QAs grounded in a different domain with 50% synthetic documents grounded in the target domain yields a better scaling curve than training solely on target-domain synthetic documents. (2) The best performance comes from mixing target-domain synthetic QAs with target-domain synthetic documents, suggesting that synthetic QAs not only teaches recall behavior but also provides domain-specific knowledge.
  • ...and 6 more figures