On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

Frédéric Piedboeuf; Philippe Langlais

On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

Frédéric Piedboeuf, Philippe Langlais

TL;DR

It is shown that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.

Abstract

Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning, and that spending more time doing so before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set, as to not impair training) and why did DA show positive results (facilitates training of network). We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.

On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

TL;DR

Abstract

Paper Structure (20 sections, 4 figures, 8 tables)

This paper contains 20 sections, 4 figures, 8 tables.

Introduction
Related Work
Datasets
Data augmentation methods
Classical methods
Large Language Models
Baselines
Experimental setups
Better fine-tuning
More realistic uses of data
Experiments
On the need to better fine-tuning
On the inefficiency of classical DA
On more realistic uses of data
Analysis of DA with LLMs
...and 5 more sections

Figures (4)

Figure 1: Strategies tested in this paper. Green algorithms are Contextual-based methods, yellow are paraphrasing methods, red are word-manipulation methods, and purple are methods using LLMs.
Figure 2: Graphic representation of the four settings we test for data augmentation on small data learning. Blue represent the original training set, purple the validation set, and yellow, the test set.
Figure 3: Percent of times the row algorithm performs statistically better than the column algorithm, with a p-value threshold of 0.05 and using a two-tails paired t-test, and across the two small data settings (10/20).
Figure 4: Percentage of times the row algorithm performs statistically better than the column algorithm, with a p-value threshold of 0.05 and using a two-tails paired t-test and with the starting sizes of 500/1000.

On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

TL;DR

Abstract

On Evaluation Protocols for Data Augmentation in a Limited Data Scenario

Authors

TL;DR

Abstract

Table of Contents

Figures (4)