Table of Contents
Fetching ...

Simple-Sampling and Hard-Mixup with Prototypes to Rebalance Contrastive Learning for Text Classification

Mengyu Li, Yonghao Liu, Fausto Giunchiglia, Ximing Li, Xiaoyue Feng, Renchu Guan

TL;DR

SharpReCL addresses imbalanced text classification by tightly coupling supervised contrastive learning with a balanced classification branch through class prototypes. It introduces simple-sampling and hard-mixup to generate a balanced, hard-sample-rich contrastive dataset, ensuring minority classes appear in every batch and providing stronger gradients. The approach yields strong empirical results across multiple imbalanced datasets, often outperforming baselines and, on some benchmarks, large language models, while offering clear ablations and mathematical insights into gradient behavior. This work provides a practical, prototype-guided framework to improve robustness and generalization in imbalanced text classification tasks.

Abstract

Text classification is a crucial and fundamental task in web content mining. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model's performance. Moreover, these models leverage separate classification branches with cross entropy and supervised contrastive learning branches without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets. Our code is available here.

Simple-Sampling and Hard-Mixup with Prototypes to Rebalance Contrastive Learning for Text Classification

TL;DR

SharpReCL addresses imbalanced text classification by tightly coupling supervised contrastive learning with a balanced classification branch through class prototypes. It introduces simple-sampling and hard-mixup to generate a balanced, hard-sample-rich contrastive dataset, ensuring minority classes appear in every batch and providing stronger gradients. The approach yields strong empirical results across multiple imbalanced datasets, often outperforming baselines and, on some benchmarks, large language models, while offering clear ablations and mathematical insights into gradient behavior. This work provides a practical, prototype-guided framework to improve robustness and generalization in imbalanced text classification tasks.

Abstract

Text classification is a crucial and fundamental task in web content mining. Compared with the previous learning paradigm of pre-training and fine-tuning by cross entropy loss, the recently proposed supervised contrastive learning approach has received tremendous attention due to its powerful feature learning capability and robustness. Although several studies have incorporated this technique for text classification, some limitations remain. First, many text datasets are imbalanced, and the learning mechanism of supervised contrastive learning is sensitive to data imbalance, which may harm the model's performance. Moreover, these models leverage separate classification branches with cross entropy and supervised contrastive learning branches without explicit mutual guidance. To this end, we propose a novel model named SharpReCL for imbalanced text classification tasks. First, we obtain the prototype vector of each class in the balanced classification branch to act as a representation of each class. Then, by further explicitly leveraging the prototype vectors, we construct a proper and sufficient target sample set with the same size for each class to perform the supervised contrastive learning procedure. The empirical results show the effectiveness of our model, which even outperforms popular large language models across several datasets. Our code is available here.
Paper Structure (17 sections, 13 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 17 sections, 13 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Label distribution of the Ohsumed trainset. The horizontal coordinate denotes the category, and the vertical coordinate denotes the corresponding frequency.
  • Figure 2: Architecture of SharpReCL (Best viewed in color).
  • Figure 3: Sensitivity of SharpReCL with respective to $\mu$ and $\tau$ on R52 and Ohsumed datasets.
  • Figure 4: Visualization of different models on TREC under ir=50. left: SCLCE, middle: SPCL, right: Our method.
  • Figure A.1: Hyperparameter sensitivity of our model on TREC and DBLP under $ir$=50.
  • ...and 1 more figures