Table of Contents
Fetching ...

TARDiS : Text Augmentation for Refining Diversity and Separability

Kyungmin Kim, SangHun Im, GiBaeg Kim, Heung-Seon Oh

TL;DR

TARDiS tackles limitations in two-stage LLM-based text augmentation for few-shot classification by introducing SEG and CEG, which generate diverse and discriminative samples through class-specific prompts, and CA, which verifies and_modify_samples to align them with target classes. Empirical results on BANKING77, CLINC150, HWU64, and TREC6 show state-of-the-art performance with a relatively small LLM, aided by ablations that demonstrate the complementary benefits of SEG, CEG, and CA. The method preserves overall class distributions while increasing within-class diversity and between-class separability, addressing misalignment, OOD, and FN issues in the few-shot regime. The work offers practical gains for few-shot text classification and lays groundwork for expanding TA to more complex tasks and efficiency-enhanced pipelines.

Abstract

Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversity and separability. For the alignment stage, we introduce a class adaptation (CA) method to ensure that generated examples align with their target classes through verification and modification. Experimental results demonstrate TARDiS's effectiveness, outperforming state-of-the-art LLM-based TA methods in various few-shot text classification tasks. An in-depth analysis confirms the detailed behaviors at each stage.

TARDiS : Text Augmentation for Refining Diversity and Separability

TL;DR

TARDiS tackles limitations in two-stage LLM-based text augmentation for few-shot classification by introducing SEG and CEG, which generate diverse and discriminative samples through class-specific prompts, and CA, which verifies and_modify_samples to align them with target classes. Empirical results on BANKING77, CLINC150, HWU64, and TREC6 show state-of-the-art performance with a relatively small LLM, aided by ablations that demonstrate the complementary benefits of SEG, CEG, and CA. The method preserves overall class distributions while increasing within-class diversity and between-class separability, addressing misalignment, OOD, and FN issues in the few-shot regime. The work offers practical gains for few-shot text classification and lays groundwork for expanding TA to more complex tasks and efficiency-enhanced pipelines.

Abstract

Text augmentation (TA) is a critical technique for text classification, especially in few-shot settings. This paper introduces a novel LLM-based TA method, TARDiS, to address challenges inherent in the generation and alignment stages of two-stage TA methods. For the generation stage, we propose two generation processes, SEG and CEG, incorporating multiple class-specific prompts to enhance diversity and separability. For the alignment stage, we introduce a class adaptation (CA) method to ensure that generated examples align with their target classes through verification and modification. Experimental results demonstrate TARDiS's effectiveness, outperforming state-of-the-art LLM-based TA methods in various few-shot text classification tasks. An in-depth analysis confirms the detailed behaviors at each stage.
Paper Structure (18 sections, 1 equation, 4 figures, 10 tables)

This paper contains 18 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: A rectangle denotes a holistic data distribution for Transport_taxi (top left) and Transport_ticket (bottom right) classes, while circles and triangles denote seed data and augmented examples, respectively. (a) Augmented examples from previous research lin2023selectivesahu2022data, which generates examples similar to seed data using a single fixed prompt. (b) Augmented examples from TARDiS. SEG and CEG generate various examples enhancing diversity and separability through multiple prompts by spark thoughts. CA aligns misaligned examples with the target classes through verification and modification.
  • Figure 2: TARDiS framework. SEG generates contextualizing text using seed data, then generates target class example by utilizing contextualizing text. CEG generates discriminative text utilizing seed data from target and selected ambiguous class. CA verifies generated examples and modifies them to algin with the target class.
  • Figure 3: Class-wise performance comparison of SEG and CEG using RoBERTa-large in 5-shot settings on BANKING77. Results are sorted by the difference between SEG and CEG. The red line denotes an average accuracy.
  • Figure 4: Two graphs of the proportional confusion matrices from the results of a verifier (i.e., an LLM classifier) in 5-shot settings on BANKING77, where the top and bottom denote before and after CA, respectively. The striped areas denote the proportions of misaligned (TN and FP) or potentially misaligned (FN). Ground truth labels, which serve as a basis for evaluating the performance of the LLM classifier, were obtained from a model trained on the entire dataset.