SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets
Ann Yuan, Daphne Ippolito, Vitaly Nikolaev, Chris Callison-Burch, Andy Coenen, Sebastian Gehrmann
TL;DR
This work tackles the scarcity and bias in high-quality NLP benchmarks by proposing a human–AI collaborative workflow for dataset curation. It demonstrates SynthBio, a synthetic WikiBio benchmark created via seed LM generations edited by human raters, yielding a dataset with reduced noise and more balanced demographics. Evaluations show SynthBio achieves greater faithfulness and coverage than WikiBio, while remaining fluent, and reveal that models trained on WikiBio struggle more on SynthBio, highlighting the value of grounded evaluation. The study argues for controlled, human-guided synthetic data as a practical tool for robust benchmarking and broader distributional testing in NLG tasks.
Abstract
NLP researchers need more, higher-quality text datasets. Human-labeled datasets are expensive to collect, while datasets collected via automatic retrieval from the web such as WikiBio are noisy and can include undesired biases. Moreover, data sourced from the web is often included in datasets used to pretrain models, leading to inadvertent cross-contamination of training and test sets. In this work we introduce a novel method for efficient dataset curation: we use a large language model to provide seed generations to human raters, thereby changing dataset authoring from a writing task to an editing task. We use our method to curate SynthBio - a new evaluation set for WikiBio - composed of structured attribute lists describing fictional individuals, mapped to natural language biographies. We show that our dataset of fictional biographies is less noisy than WikiBio, and also more balanced with respect to gender and nationality.
