Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science
Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West
TL;DR
This study tackles the fidelity gap in synthetic data generated by large language models by evaluating three prompting strategies—grounding, filtering, and taxonomy-based generation—on a sarcasm-detection task. Grounding emerges as the most effective approach for aligning synthetic data with real-world distributions, though all strategies have trade-offs. The work demonstrates that carefully designed prompting and evaluation pipelines can provide a cost-efficient, privacy-conscious alternative to human labeling and can serve as a stepping stone for training smaller models in computational social science. It offers concrete recommendations and identifies avenues for scaling up and extending the methodology to other tasks and larger models.
Abstract
Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.
