Table of Contents
Fetching ...

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

Yung-Chieh Chan, George Pu, Apaar Shanker, Parth Suresh, Penn Jenks, John Heyer, Sam Denton

TL;DR

A practical framework for selecting the appropriate augmentation method across settings is provided, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.

Abstract

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs

TL;DR

A practical framework for selecting the appropriate augmentation method across settings is provided, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.

Abstract

As large language models (LLMs) are applied to more use cases, creating high quality, task-specific datasets for fine-tuning becomes a bottleneck for model improvement. Using high quality human data has been the most common approach to unlock model performance, but is prohibitively expensive in many scenarios. Several alternative methods have also emerged, such as generating synthetic or hybrid data, but the effectiveness of these approaches remain unclear, especially in resource-constrained scenarios and tasks that are not easily verified. To investigate this, we group various synthetic data generation strategies into three representative categories -- Answer Augmentation, Question Rephrase and New Question -- and study the performance of student LLMs trained under various constraints, namely seed instruction set size and query budget. We demonstrate that these strategies are not equally effective across settings. Notably, the optimal data generation strategy depends strongly on the ratio between the available teacher query budget and the size of the seed instruction set. When this ratio is low, generating new answers to existing questions proves most effective, but as this ratio increases, generating new questions becomes optimal. Across all tasks, we find that choice of augmentation method and other design choices matter substantially more in low to mid data regimes than in high data regimes. We provide a practical framework for selecting the appropriate augmentation method across settings, taking into account additional factors such as the scalability of each method, the importance of verifying synthetic data, and the use of different LLMs for synthetic data generation.
Paper Structure (25 sections, 5 equations, 7 figures, 9 tables)

This paper contains 25 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Overview of Synthetic Data Generation Approaches. Given a seed instruction set, we have 3 different methods to create instruction-response pairs for fine-tuning our student model. We use an example seed instruction from the ARC-C training set with synthetic instructions and responses generated with Llama 3.1 70b Instruct.
  • Figure 2: Student model $\pi_S$ accuracy on GSM8k (Top), Spider (Middle) and ARC-C (Bottom) after fine-tuning on synthetic data from our teacher model $\pi_T$ and across resource constraints.
  • Figure 3: Cost-Effectiveness on GSM8k (Top), Spider (Middle), and ARC-C (Bottom): Across all three seed instruction sizes, the dashed line marks the query budget when the optimal data generation strategy changes from generating new responses to new instructions. For details on how the regression curves were fitted using our scaling relationship model, please refer to Appendix \ref{['appendix:scaling_relationship']}.
  • Figure 4: Performance Trade-off with Weaker Augmentation Model $\pi_{aug}$ on GSM8k
  • Figure 5: Ablations measuring the effect of verification with 1,000 seed instructions from Spider. We ensure the synthetic data size is the amount after filtering.
  • ...and 2 more figures