Table of Contents
Fetching ...

Batch Aggregation: An Approach to Enhance Text Classification with Correlated Augmented Data

Charco Hui, Yalu Wen

TL;DR

This work addresses the challenge of degraded text classification performance when labeled data are scarce by treating augmented texts as correlated observations rather than independent samples. It introduces Batch Aggregation (BAGG), a pooling-based extension that aggregates augmented inputs at the per-original-input level, yielding a per-observation loss and potentially unbiased gradients. The method is instantiated with a BERT-based classifier and extended to combine multiple augmentation methods (EDA, back-translation via OPUS-MT and Google Translate), showing consistent accuracy gains across open-domain and domain-specific datasets, with larger improvements in low-data regimes. The findings suggest BAGG enhances robustness and reliability of text classification when training data are limited, particularly in specialized domains such as medicine and clinical trials, by properly accounting for correlations among augmented texts.

Abstract

Natural language processing models often face challenges due to limited labeled data, especially in domain specific areas, e.g., clinical trials. To overcome this, text augmentation techniques are commonly used to increases sample size by transforming the original input data into artificial ones with the label preserved. However, traditional text classification methods ignores the relationship between augmented texts and treats them as independent samples which may introduce classification error. Therefore, we propose a novel approach called 'Batch Aggregation' (BAGG) which explicitly models the dependence of text inputs generated through augmentation by incorporating an additional layer that aggregates results from correlated texts. Through studying multiple benchmark data sets across different domains, we found that BAGG can improve classification accuracy. We also found that the increase of performance with BAGG is more obvious in domain specific data sets, with accuracy improvements of up to 10-29%. Through the analysis of benchmark data, the proposed method addresses limitations of traditional techniques and improves robustness in text classification tasks. Our result demonstrates that BAGG offers more robust results and outperforms traditional approaches when training data is limited.

Batch Aggregation: An Approach to Enhance Text Classification with Correlated Augmented Data

TL;DR

This work addresses the challenge of degraded text classification performance when labeled data are scarce by treating augmented texts as correlated observations rather than independent samples. It introduces Batch Aggregation (BAGG), a pooling-based extension that aggregates augmented inputs at the per-original-input level, yielding a per-observation loss and potentially unbiased gradients. The method is instantiated with a BERT-based classifier and extended to combine multiple augmentation methods (EDA, back-translation via OPUS-MT and Google Translate), showing consistent accuracy gains across open-domain and domain-specific datasets, with larger improvements in low-data regimes. The findings suggest BAGG enhances robustness and reliability of text classification when training data are limited, particularly in specialized domains such as medicine and clinical trials, by properly accounting for correlations among augmented texts.

Abstract

Natural language processing models often face challenges due to limited labeled data, especially in domain specific areas, e.g., clinical trials. To overcome this, text augmentation techniques are commonly used to increases sample size by transforming the original input data into artificial ones with the label preserved. However, traditional text classification methods ignores the relationship between augmented texts and treats them as independent samples which may introduce classification error. Therefore, we propose a novel approach called 'Batch Aggregation' (BAGG) which explicitly models the dependence of text inputs generated through augmentation by incorporating an additional layer that aggregates results from correlated texts. Through studying multiple benchmark data sets across different domains, we found that BAGG can improve classification accuracy. We also found that the increase of performance with BAGG is more obvious in domain specific data sets, with accuracy improvements of up to 10-29%. Through the analysis of benchmark data, the proposed method addresses limitations of traditional techniques and improves robustness in text classification tasks. Our result demonstrates that BAGG offers more robust results and outperforms traditional approaches when training data is limited.

Paper Structure

This paper contains 15 sections, 2 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Batch Aggregation (BAGG) with two inputs
  • Figure 2: Batch Aggregation with two augmentation methods
  • Figure 3: Average accuracies of Standard augmentation on the Amazon data set. "Combined" represents the combinations of EDA, Google and OPUS-MT.
  • Figure 4: Average accuracies of Standard augmentation on the Newsgroup data set.
  • Figure 5: Average accuracies of Standard augmentation on the LitCovid data set.
  • ...and 2 more figures