Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification
Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng
TL;DR
This work tackles the problem of misalignment between synthetic data from large language models and real-world data distributions in text classification. It introduces two efficient weighted-loss strategies, IMP-Loss and DIMP-Loss, which leverage small real-world data as quality checkpoints and diversity signals to shape learning from abundant LLM-generated data. The methods provide principled, computationally tractable means to transform the LLM data distribution toward the real-world distribution, demonstrated to yield robust improvements across multiple benchmarks and data regimes, including cases where the data generator alone falls short. Practically, the approach enables scalable, data-efficient utilization of synthetic data for NLP, with strong empirical gains and favorable resource profiles compared to meta-learning-based baselines.
Abstract
Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.
