Table of Contents
Fetching ...

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

Hsun-Yu Kuo, Yin-Hsiang Liao, Yu-Chieh Chao, Wei-Yun Ma, Pu-Jen Cheng

TL;DR

This work tackles the problem of misalignment between synthetic data from large language models and real-world data distributions in text classification. It introduces two efficient weighted-loss strategies, IMP-Loss and DIMP-Loss, which leverage small real-world data as quality checkpoints and diversity signals to shape learning from abundant LLM-generated data. The methods provide principled, computationally tractable means to transform the LLM data distribution toward the real-world distribution, demonstrated to yield robust improvements across multiple benchmarks and data regimes, including cases where the data generator alone falls short. Practically, the approach enables scalable, data-efficient utilization of synthetic data for NLP, with strong empirical gains and favorable resource profiles compared to meta-learning-based baselines.

Abstract

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

Not All LLM-Generated Data Are Equal: Rethinking Data Weighting in Text Classification

TL;DR

This work tackles the problem of misalignment between synthetic data from large language models and real-world data distributions in text classification. It introduces two efficient weighted-loss strategies, IMP-Loss and DIMP-Loss, which leverage small real-world data as quality checkpoints and diversity signals to shape learning from abundant LLM-generated data. The methods provide principled, computationally tractable means to transform the LLM data distribution toward the real-world distribution, demonstrated to yield robust improvements across multiple benchmarks and data regimes, including cases where the data generator alone falls short. Practically, the approach enables scalable, data-efficient utilization of synthetic data for NLP, with strong empirical gains and favorable resource profiles compared to meta-learning-based baselines.

Abstract

Synthetic data augmentation via large language models (LLMs) allows researchers to leverage additional training data, thus enhancing the performance of downstream tasks, especially when real-world data is scarce. However, the generated data can deviate from the real-world data, and this misalignment can bring deficient outcomes while applying the trained model to applications. Therefore, we proposed efficient weighted-loss approaches to align synthetic data with real-world distribution by emphasizing high-quality and diversified data generated by LLMs with using merely a little real-world data. We empirically assessed the effectiveness of our method on multiple text classification tasks, and the results showed leveraging our approaches on a BERT-level model robustly outperformed standard cross-entropy and other data weighting approaches, providing potential solutions to effectively leveraging synthetic data from any suitable data generator for model training.

Paper Structure

This paper contains 56 sections, 1 theorem, 27 equations, 7 figures, 6 tables, 2 algorithms.

Key Result

Theorem B.1

Let $X$ be a random variable with finite expected value $\mathbb{E}[X]$ and variance $\text{Var}(X)$. For any $\epsilon > 0$,

Figures (7)

  • Figure 1: Training dynamics shows the testing accuracy over five epochs for benchmarks. This chart displays the minimum, maximum, and average accuracy observed across four runs with different random seeds, comparing our proposed methods with the standard CE-Loss and Focal-Loss.
  • Figure 2: Test accuracy on the Financial with varying percentages of the training set for the quality checker. The graph shows the performance of each loss and the Quality Checker.
  • Figure 3: Total running time (in seconds) for CE-Loss, IMP-Loss, and DIMP-Loss on the LLM-generated Financial benchmark.
  • Figure 4: Average Quality Checker Score, Diversity Checker Score, and Weights of IMP-Loss for Financial Dataset: Comparison between Original, Swapped Label, Duplicated Data and Unrelated Input Data.
  • Figure 5: Average Quality Checker Score, Diversity Checker Score, and Weights of IMP-Loss for Tweet Irony Dataset: Comparison between Original, Swapped Label, and Duplicated Data
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition B.1: Convergence in Probability
  • Theorem B.1: Chebyshev's Inequality