Table of Contents
Fetching ...

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

Jan Cegin, Jakub Simko, Peter Brusilovsky

TL;DR

This study systematically compares LLM-based text augmentation with established methods for downstream text classification across six English datasets, three classifier architectures, and two fine-tuning strategies, totaling 267,300 fine-tunings. It finds that LLM-based paraphrasing can outperform established methods primarily when the seed count per label is very small (5–20), but benefits diminish as seeds increase and costs (time, money, and emissions) rise markedly. Established methods, especially contextual insertion, often match or exceed LLM-based gains in accuracy at a fraction of the cost, making them preferable in typical resource settings. The work provides practical guidance: reserve LLM-based augmentation for low-resource scenarios, and rely on cheaper established techniques for broader use, while acknowledging limitations and potential future extensions (prompt design, broader LLM coverage, multilingual scenarios).

Abstract

The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. However, a research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compared the effects of recent LLM augmentation methods with established ones on 6 datasets, 3 classifiers and 2 fine-tuning methods. We also varied the number of seeds and collected samples to better explore the downstream model accuracy space. Finally, we performed a cost-benefit analysis and show that LLM-based methods are worthy of deployment only when very small number of seeds is used. Moreover, in many cases, established methods lead to similar or better model accuracies.

LLMs vs Established Text Augmentation Techniques for Classification: When do the Benefits Outweight the Costs?

TL;DR

This study systematically compares LLM-based text augmentation with established methods for downstream text classification across six English datasets, three classifier architectures, and two fine-tuning strategies, totaling 267,300 fine-tunings. It finds that LLM-based paraphrasing can outperform established methods primarily when the seed count per label is very small (5–20), but benefits diminish as seeds increase and costs (time, money, and emissions) rise markedly. Established methods, especially contextual insertion, often match or exceed LLM-based gains in accuracy at a fraction of the cost, making them preferable in typical resource settings. The work provides practical guidance: reserve LLM-based augmentation for low-resource scenarios, and rely on cheaper established techniques for broader use, while acknowledging limitations and potential future extensions (prompt design, broader LLM coverage, multilingual scenarios).

Abstract

The generative large language models (LLMs) are increasingly being used for data augmentation tasks, where text samples are LLM-paraphrased and then used for classifier fine-tuning. However, a research that would confirm a clear cost-benefit advantage of LLMs over more established augmentation methods is largely missing. To study if (and when) is the LLM-based augmentation advantageous, we compared the effects of recent LLM augmentation methods with established ones on 6 datasets, 3 classifiers and 2 fine-tuning methods. We also varied the number of seeds and collected samples to better explore the downstream model accuracy space. Finally, we performed a cost-benefit analysis and show that LLM-based methods are worthy of deployment only when very small number of seeds is used. Moreover, in many cases, established methods lead to similar or better model accuracies.
Paper Structure (24 sections, 10 figures, 7 tables)

This paper contains 24 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of our methodology. For each dataset, we randomly sample 100 samples per label which are then used to collect up to 15 augmented samples per each seed samples. These seeds are then randomly sampled in various sizes and used for fine-tuning with various number of augmented samples to evaluate each method.
  • Figure 2: The difference in mean accuracy for classifiers trained on the paraphrasing augmentation method and the contextual insert augmentation method for 6 different datasets. The paraphrasing method works generally better for a small (5-20) number of seeds per label and this benefit deteriorates with increased number of seeds per label.
  • Figure 3: The number of cases per number of collected augmented samples per seed sample where each augmentation method achieved best accuracy for 6 different combinations of models and fine-tuning methods. Except for RoBERTa and DistilBERT full fine-tuning, the methods worked best for model accuracy when more augmented samples were provided.
  • Figure 4: The difference in mean accuracy for classifiers trained on the paraphrasing augmentation method and the backtranslation augmentation method for 6 different datasets. The paraphrasing method works generally better in all cases.
  • Figure 5: The difference in mean accuracy for classifiers trained on the paraphrasing augmentation method and the contextual swap augmentation method for 6 different datasets. The paraphrasing method works generally better in all cases with an decreasing effect with increased number of seeds per label.
  • ...and 5 more figures