Table of Contents
Fetching ...

BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework

Heng Yang, Ke Li

TL;DR

This work tackles the widespread issue that text augmentation on large public datasets often shifts the feature space, causing degraded performance. It introduces BoostAug, a two-phase framework that uses a surrogate DeBERTa-based filter trained via k-fold cross-boosting to guide and filter augmentation instances produced by various backends, thereby preserving alignment with natural data. The approach employs perplexity filtering, confidence ranking, and predicted-label constraints, together with a feature-space-shift metric based on convex hull overlap and distribution skewness to diagnose and mitigate misalignments. Empirical results across TC, ABSC, and NLI tasks demonstrate consistent improvements over baseline augmentations, with ablations highlighting the importance of cross-boosting and filtering components, and the authors release code to facilitate adoption on large datasets.

Abstract

Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.

BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework

TL;DR

This work tackles the widespread issue that text augmentation on large public datasets often shifts the feature space, causing degraded performance. It introduces BoostAug, a two-phase framework that uses a surrogate DeBERTa-based filter trained via k-fold cross-boosting to guide and filter augmentation instances produced by various backends, thereby preserving alignment with natural data. The approach employs perplexity filtering, confidence ranking, and predicted-label constraints, together with a feature-space-shift metric based on convex hull overlap and distribution skewness to diagnose and mitigate misalignments. Empirical results across TC, ABSC, and NLI tasks demonstrate consistent improvements over baseline augmentations, with ablations highlighting the importance of cross-boosting and filtering components, and the authors release code to facilitate adoption on large datasets.

Abstract

Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained language models that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.
Paper Structure (32 sections, 8 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 32 sections, 8 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: The visualization of feature space shift of the Laptop14 dataset based on $t$-SNE. We calculate the shift metric $\mathcal{S}$ of feature space between augmented and natural instances. The augmentation methods are BoostAug, MonoAug, and EDA augmentation, respectively. Our BoostAug has the least feature space shift.
  • Figure 2: The workflow of BoostAug can be divided into two phases: Phase #$1$ and Phase #$2$. In Phase #$1$, we fine-tune a DeBERTa-based classification model using re-split training and validation sets and extract the fine-tuned DeBERTa to build a surrogate language model. In Phase #$2$, BoostAug employs a text augmentation backend to generate raw augmentations and filters out low-quality instances identified by the surrogate language model. To avoid data overlapping between the training folds and validation fold, BoostAug performs $k$-fold cross-boosting, meaning that Phase #$1$ and #$2$ are repeated $k$ times.
  • Figure 3: Trajectories of the Acc and the F1 values with error bars versus the number of augmentation instances generated for an example by using BoostAug (EDA). The trajectory visualization plot of MonoAug and EDA can be found in fig:rq4_full
  • Figure 4: The Scott-knott rank test plots under different $\alpha$ and $\beta$ in BoostAug (EDA). The bigger rank means better performance.
  • Figure 5: The performance box plots under different $\alpha$ and $\beta$ in BoostAug (EDA).
  • ...and 2 more figures