Simple-Sampling and Hard-Mixup with Prototypes to Rebalance Contrastive Learning for Text Classification
Mengyu Li, Yonghao Liu, Fausto Giunchiglia, Ximing Li, Xiaoyue Feng, Renchu Guan
TL;DR
SharpReCL addresses imbalanced text classification by tightly coupling supervised contrastive learning with a balanced classification branch through class prototypes. It introduces simple-sampling and hard-mixup to generate a balanced, hard-sample-rich contrastive dataset, ensuring minority classes appear in every batch and providing stronger gradients. The approach yields strong empirical results across multiple imbalanced datasets, often outperforming baselines and, on some benchmarks, large language models, while offering clear ablations and mathematical insights into gradient behavior. This work provides a practical, prototype-guided framework to improve robustness and generalization in imbalanced text classification tasks.
Abstract
Text classification is a crucial and fundamental task in web content mining. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model's performance. Moreover, these models leverage separate classification branches with cross entropy and supervised contrastive learning branches without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets. Our code is available here.
