Multi-Sample Prompting and Actor-Critic Prompt Optimization for Diverse Synthetic Data Generation
Abdelkarim El-Hajjami, Camille Salinesi
Abstract
High-quality labeled datasets are fundamental for training and evaluating machine learning models, yet domains such as healthcare and Requirements Engineering (RE) face persistent barriers due to data scarcity, privacy constraints, or proprietary restrictions. While Large Language Models (LLMs) offer a promising avenue for Synthetic Data Generation (SDG), LLM-generated data tends to be repetitive and low in diversity, reducing its effectiveness for downstream tasks. Two approaches show potential for addressing this limitation: (1) multi-sample prompting, which generates multiple samples per prompt to reduce repetition, and (2) Prompt with Actor-Critic Editing (PACE), which iteratively refines prompts to maximize diversity. We integrate both mechanisms into Synthline, a Feature Model-based configurable synthetic data generator, and assess their effects on diversity and downstream utility across four RE classification tasks. Multi-sample prompting consistently improves both diversity and utility, with F1-score gains of 6 to 43.8 percentage points. PACE-based prompt optimization consistently improves lexical diversity but produces task-dependent utility effects, revealing the risks of optimizing for diversity alone. Most notably, synthetic data can match or surpass human-authored data for tasks where real labeled data is limited, with improvements of up to 15.4 percentage points in F1-score.
