Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification
Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang
TL;DR
This paper introduces Text Grafting, a hybrid framework for extremely weak-supervised text classification that targets minority classes by marrying text mining with data synthesis. By extracting high-potential components from raw text via LLM-based logits to form masks and templates, then filling these templates with a powerful LLM, the method generates near-distribution in-class texts that improve classifier performance. Across multiple datasets, Text Grafting outperforms pure mining and pure synthesis baselines, with ablations confirming the importance of each stage and analyses showing that grafted data lie near the distribution of the target domain. The approach reduces reliance on fully annotated data and negative sample synthesis, offers robustness to zero-occurrence scenarios, and provides practical guidance on hyperparameters and template design for minority-class learning.
Abstract
For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution (i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine the advantages of these two approaches and propose to bridge the gap via a novel framework, \emph{text grafting}, which aims to obtain clean and near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis on minority classes. We also use analysis and case studies to comprehend the property of text grafting.
