Genie: Achieving Human Parity in Content-Grounded Datasets Generation
Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, Leshem Choshen
TL;DR
Genie tackles the data bottleneck in content-grounded generation by introducing a three-stage pipeline—Content Preparation, Generation, and Filtering—to automatically curate high-quality synthetic data. The method extracts grounding content, generates task-specific examples via few-shot prompting, and filters outputs for format, faithfulness, and overall quality, yielding datasets such as Wish-QA-NQ, Wish-QA-ELI5/ASQA, and Wish-QA. Empirical results show models trained on Genie data are on par with or surpass models trained on human data, with especially strong improvements in faithfulness and lexical diversity, and demonstrated benefits in domain adaptation to medical LFQA. Overall, Genie offers a scalable, cost-efficient path to high-quality content-grounded datasets across tasks and domains, with publicly released data and strong practical impact for RAG, QA, and summarization systems.
Abstract
The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose Genie, a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating three large-scale synthetic data, making wishes, for Long-Form Question-Answering (LFQA), summarization, and information extraction. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data -- ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.
