Table of Contents
Fetching ...

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

Abhishek Divekar, Greg Durrett

TL;DR

This work proposes Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples.

Abstract

It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our code to perform all steps at https://github.com/amazon-science/synthesizrr

SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation

TL;DR

This work proposes Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples.

Abstract

It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory constraints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM's parametric knowledge to generate usable examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work, we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when compared to 32-shot prompting and four prior approaches. We release our code to perform all steps at https://github.com/amazon-science/synthesizrr
Paper Structure (40 sections, 2 equations, 8 figures, 27 tables, 1 algorithm)

This paper contains 40 sections, 2 equations, 8 figures, 27 tables, 1 algorithm.

Figures (8)

  • Figure 1: Synthetic examples from few-shot generation (middle) and SynthesizRR (bottom). Our approach incorporates a content sourcing step which retrieves documents from a corpus: for the task of detecting political bias, a news article is retrieved and the teacher LLM is prompted to produce a biased version. The resulting synthesis procedure yields diverse examples which more closely match human-written examples.
  • Figure 2: Abstract depiction of the SynthesizRR procedure. In the content sourcing stage, we retrieve $K$ unique document $\{ r_1, \dots, r_K\}$ from a large corpus for each in-context covariate $x_{\text{ICL}}$. The task-inversion stage of synthesis uses a parameterized context refinement prompt$\mathcal{P}_{\tau}$, which takes parameters $R_{inv}$ (inversion instruction), $r_k$ (a retrieved document), and $\mathcal{V}(y_{\text{ICL}})$ (the verbalized target label). A generalist teacher LLM autoregressively generates a synthetic covariate. Each in-context example thus produces $K$ unique synthetic examples $\{\tilde{x}_1, \dots, \tilde{x}_K\}$, which we include in the dataset with target $y_{\text{ICL}}$.
  • Figure 3: Self-BLEU $(\downarrow)$ for ngrams n=1-5. Comparison: Gold, FewGen 0-shot, FewGen 32-shot, SynthesizRR 0-shot, SynthesizRR 3-shot RetrICL, SynthesizRR 32-shot Non-RetrICL.
  • Figure 4: Entity entropy $(\uparrow)$ on ToI (headlines) and Category (reviews). Comparison: Gold, FewGen 32-shot, SynthesizRR 3-shot RetrICL and SynthesizRR 32-shot Non-RetrICL. Zero-shot results are similar for SynthesizRR and worse for FewGen; we omit them.
  • Figure 5: Data maps from a DistilBERT training run on $8$K Category rows from LLaMa2. FewGen (center) is skewed towards easy-to-learn examples (top-left) while Gold (left) and SynthesizRR (right) have a higher density of ambiguous examples.
  • ...and 3 more figures