Table of Contents
Fetching ...

Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang

TL;DR

This paper introduces Text Grafting, a hybrid framework for extremely weak-supervised text classification that targets minority classes by marrying text mining with data synthesis. By extracting high-potential components from raw text via LLM-based logits to form masks and templates, then filling these templates with a powerful LLM, the method generates near-distribution in-class texts that improve classifier performance. Across multiple datasets, Text Grafting outperforms pure mining and pure synthesis baselines, with ablations confirming the importance of each stage and analyses showing that grafted data lie near the distribution of the target domain. The approach reduces reliance on fully annotated data and negative sample synthesis, offers robustness to zero-occurrence scenarios, and provides practical guidance on hyperparameters and template design for minority-class learning.

Abstract

For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.

Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification

TL;DR

This paper introduces Text Grafting, a hybrid framework for extremely weak-supervised text classification that targets minority classes by marrying text mining with data synthesis. By extracting high-potential components from raw text via LLM-based logits to form masks and templates, then filling these templates with a powerful LLM, the method generates near-distribution in-class texts that improve classifier performance. Across multiple datasets, Text Grafting outperforms pure mining and pure synthesis baselines, with ablations confirming the importance of each stage and analyses showing that grafted data lie near the distribution of the target domain. The approach reduces reliance on fully annotated data and negative sample synthesis, offers robustness to zero-occurrence scenarios, and provides practical guidance on hyperparameters and template design for minority-class learning.

Abstract

For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.
Paper Structure (35 sections, 2 equations, 8 figures, 4 tables)

This paper contains 35 sections, 2 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: The framework of text grafting.
  • Figure 2: The precision of state-of-the-art text mining on same classes with different class proportions. "Precision" refers to the precision of the pseudo-labels. "Class Proportion" means the ratio of the texts of this class in the entire corpus after down-sampling.
  • Figure 3: The overview of text grafting with the minority class "Surprised" in the Emotion dataset as an example. Text grafting includes two stages: 1) Text (Template) Mining: Create scored templates and select the ones with the top scores. 2) Data Synthesis: Prompt the LLM to fill in the templates to synthesize in-class texts.
  • Figure 4: The visualization of text distributions from different methods.
  • Figure 5: The analysis on the necessity of negative data synthesis.
  • ...and 3 more figures