Table of Contents
Fetching ...

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

Veniamin Veselovsky, Manoel Horta Ribeiro, Akhil Arora, Martin Josifoski, Ashton Anderson, Robert West

TL;DR

This study tackles the fidelity gap in synthetic data generated by large language models by evaluating three prompting strategies—grounding, filtering, and taxonomy-based generation—on a sarcasm-detection task. Grounding emerges as the most effective approach for aligning synthetic data with real-world distributions, though all strategies have trade-offs. The work demonstrates that carefully designed prompting and evaluation pipelines can provide a cost-efficient, privacy-conscious alternative to human labeling and can serve as a stepping stone for training smaller models in computational social science. It offers concrete recommendations and identifies avenues for scaling up and extending the methodology to other tasks and larger models.

Abstract

Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.

Generating Faithful Synthetic Data with Large Language Models: A Case Study in Computational Social Science

TL;DR

This study tackles the fidelity gap in synthetic data generated by large language models by evaluating three prompting strategies—grounding, filtering, and taxonomy-based generation—on a sarcasm-detection task. Grounding emerges as the most effective approach for aligning synthetic data with real-world distributions, though all strategies have trade-offs. The work demonstrates that carefully designed prompting and evaluation pipelines can provide a cost-efficient, privacy-conscious alternative to human labeling and can serve as a stepping stone for training smaller models in computational social science. It offers concrete recommendations and identifies avenues for scaling up and extending the methodology to other tasks and larger models.

Abstract

Large Language Models (LLMs) have democratized synthetic data generation, which in turn has the potential to simplify and broaden a wide gamut of NLP tasks. Here, we tackle a pervasive problem in synthetic data generation: its generative distribution often differs from the distribution of real-world data researchers care about (in other words, it is unfaithful). In a case study on sarcasm detection, we study three strategies to increase the faithfulness of synthetic data: grounding, filtering, and taxonomy-based generation. We evaluate these strategies using the performance of classifiers trained with generated synthetic data on real-world data. While all three strategies improve the performance of classifiers, we find that grounding works best for the task at hand. As synthetic data generation plays an ever-increasing role in NLP research, we expect this work to be a stepping stone in improving its utility. We conclude this paper with some recommendations on how to generate high(er)-fidelity synthetic data for specific tasks.
Paper Structure (12 sections, 2 figures, 2 tables)

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Depiction of the proposed strategies to increase the faithfulness of synthetically generated data. On the left-hand side, we depict different prompting strategies: asking an LLM to generate synthetic data with a simple prompt (Simple); grounding the synthetic data generation with real-world examples (Grounding-rewrite); and providing a taxonomy along with your prompt (Taxonomy). We also train a discriminator to distinguish between real and fake prompts and filter the data (as indicated by the dotted orange boxes on the right-hand side; Filtering).
  • Figure 2: Our prompting approach consists of four modular steps. (1) Initiate the model to generate an initial set of 10 data points. (2) Apply a grounding technique as the model generates these 10 data points. (3) Further augment the grounding process by providing the model with an initial taxonomy. (4) Lastly, the results from the grounding phase are filtered through a real-synthetic classifier to ensure their authenticity.