Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo; Yin-Hsiang Liao; Yu-Chieh Chao; Wei-Yun Ma; Pu-Jen Cheng

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

TL;DR

This work tackles the problem of misalignment between synthetic data from large language models and real-world data distributions in text classification. It introduces two efficient weighted-loss strategies, IMP-Loss and DIMP-Loss, which leverage small real-world data as quality checkpoints and diversity signals to shape learning from abundant LLM-generated data. The methods provide principled, computationally tractable means to transform the LLM data distribution toward the real-world distribution, demonstrated to yield robust improvements across multiple benchmarks and data regimes, including cases where the data generator alone falls short. Practically, the approach enables scalable, data-efficient utilization of synthetic data for NLP, with strong empirical gains and favorable resource profiles compared to meta-learning-based baselines.

Abstract

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

TL;DR

Abstract

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)