Table of Contents
Fetching ...

Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models

Ruida Wang, Wangchunshu Zhou, Mrinmaya Sachan

TL;DR

The paper tackles data efficiency for small NLP models by addressing the distribution gap that arises when using LLM-generated data. It introduces Synthesis Step by Step (S3), a dynamic framework that bootsraps with seed data synthesized via rationales and then iteratively refines the dataset through error extrapolation on a small gold validation set guided by an LLM. The authors provide theoretical analysis showing how extrapolated errors combined with data augmentation can recover the gold distribution and demonstrate strong empirical gains across IMDb, QNLI, RTE, and AdQA with a fraction of the data required by prior methods. This approach offers a practical pathway to deploy compact models at scale with significant improvements in data and compute efficiency, though it highlights sensitivities to prompts and task specificity as areas for further work.

Abstract

*Data Synthesis* is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language models to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the *real task* data distribution. Thus, in this paper, we propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made by a small model trained on the synthesized dataset on a small real-world validation dataset using a large language model. Extensive experiments on multiple NLP tasks show that our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data, resulting in significant improvement compared to several baselines: 9.48% improvement compared to ZeroGen and 2.73% compared to GoldGen, and at most 15.17% improvement compared to the small model trained on human-annotated data.

Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models

TL;DR

The paper tackles data efficiency for small NLP models by addressing the distribution gap that arises when using LLM-generated data. It introduces Synthesis Step by Step (S3), a dynamic framework that bootsraps with seed data synthesized via rationales and then iteratively refines the dataset through error extrapolation on a small gold validation set guided by an LLM. The authors provide theoretical analysis showing how extrapolated errors combined with data augmentation can recover the gold distribution and demonstrate strong empirical gains across IMDb, QNLI, RTE, and AdQA with a fraction of the data required by prior methods. This approach offers a practical pathway to deploy compact models at scale with significant improvements in data and compute efficiency, though it highlights sensitivities to prompts and task specificity as areas for further work.

Abstract

*Data Synthesis* is a promising way to train a small model with very little labeled data. One approach for data synthesis is to leverage the rich knowledge from large language models to synthesize pseudo training examples for small models, making it possible to achieve both data and compute efficiency at the same time. However, a key challenge in data synthesis is that the synthesized dataset often suffers from a large distributional discrepancy from the *real task* data distribution. Thus, in this paper, we propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap by iteratively extrapolating the errors made by a small model trained on the synthesized dataset on a small real-world validation dataset using a large language model. Extensive experiments on multiple NLP tasks show that our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data, resulting in significant improvement compared to several baselines: 9.48% improvement compared to ZeroGen and 2.73% compared to GoldGen, and at most 15.17% improvement compared to the small model trained on human-annotated data.
Paper Structure (32 sections, 7 equations, 3 figures, 9 tables, 3 algorithms)

This paper contains 32 sections, 7 equations, 3 figures, 9 tables, 3 algorithms.

Figures (3)

  • Figure 1: Training and testing accuracy of DistilBert with ZeroGen ye2022zerogen on the IMDb dataset with 200k training datapoints. Also shown are the training and testing accuracy of the model trained on GoldData. We can see here that ZeroGen's training accuracy quickly reaches nearly 100%, but testing accuracy remains low.
  • Figure 2: Both (a) traditional zero-shot dataset synthesis methods and (b) training small models directly on gold data do not leverage feedback from the small model trained on the synthesized dataset. In contrast, (c) our approach, S3, first synthesizes a seed dataset in a zero-shot fashion with rationales (left-hand side). Then, we iteratively reduce the gap between the synthesized data distribution and the gold data distribution by extrapolating the errors of a small model trained on the currently synthesized data on a small gold validation set. The additional synthesized data can, therefore, be considered to be sampled from the difference between the currently synthesized data distribution and gold data distribution. By mixing it with the currently synthesized data, we can recover the gold data distribution and therefore improve the performance of a small model trained on the data mixture.
  • Figure 3: t-SNE result for QNLI (left), RTE (center), AdQA (right) for dataset diversity analysis. ZeroGen data's points are plotted in Yellow, S3's in Green, and Gold data in Purple.