Table of Contents
Fetching ...

IDoFew: Intermediate Training Using Dual-Clustering in Language Models for Few Labels Text Classification

Abdullah Alsuhaibani, Hamad Zogan, Imran Razzak, Shoaib Jameel, Guandong Xu

TL;DR

IDoFew tackles the cold-start problem in few-label text classification by introducing a dual-clustering intermediate training framework. The method combines a first-stage SIB clustering on TF-IDF features with a second-stage SBERT-based KMeans refinement on a small subset, followed by fine-tuning a pre-trained language model using a limited labeled set, $|D_l| \\ll |D|$. Across seven datasets and multiple base PTMs, IDoFew consistently outperforms baselines and state-of-the-art intermediate-training approaches, achieving substantial gains on multi-class tasks and demonstrating robustness to label scarcity. The work shows practical potential for improving label-efficient NLP pipelines by transferring knowledge through a dual-clustering intermediate stage.

Abstract

Language models such as Bidirectional Encoder Representations from Transformers (BERT) have been very effective in various Natural Language Processing (NLP) and text mining tasks including text classification. However, some tasks still pose challenges for these models, including text classification with limited labels. This can result in a cold-start problem. Although some approaches have attempted to address this problem through single-stage clustering as an intermediate training step coupled with a pre-trained language model, which generates pseudo-labels to improve classification, these methods are often error-prone due to the limitations of the clustering algorithms. To overcome this, we have developed a novel two-stage intermediate clustering with subsequent fine-tuning that models the pseudo-labels reliably, resulting in reduced prediction errors. The key novelty in our model, IDoFew, is that the two-stage clustering coupled with two different clustering algorithms helps exploit the advantages of the complementary algorithms that reduce the errors in generating reliable pseudo-labels for fine-tuning. Our approach has shown significant improvements compared to strong comparative models.

IDoFew: Intermediate Training Using Dual-Clustering in Language Models for Few Labels Text Classification

TL;DR

IDoFew tackles the cold-start problem in few-label text classification by introducing a dual-clustering intermediate training framework. The method combines a first-stage SIB clustering on TF-IDF features with a second-stage SBERT-based KMeans refinement on a small subset, followed by fine-tuning a pre-trained language model using a limited labeled set, . Across seven datasets and multiple base PTMs, IDoFew consistently outperforms baselines and state-of-the-art intermediate-training approaches, achieving substantial gains on multi-class tasks and demonstrating robustness to label scarcity. The work shows practical potential for improving label-efficient NLP pipelines by transferring knowledge through a dual-clustering intermediate stage.

Abstract

Language models such as Bidirectional Encoder Representations from Transformers (BERT) have been very effective in various Natural Language Processing (NLP) and text mining tasks including text classification. However, some tasks still pose challenges for these models, including text classification with limited labels. This can result in a cold-start problem. Although some approaches have attempted to address this problem through single-stage clustering as an intermediate training step coupled with a pre-trained language model, which generates pseudo-labels to improve classification, these methods are often error-prone due to the limitations of the clustering algorithms. To overcome this, we have developed a novel two-stage intermediate clustering with subsequent fine-tuning that models the pseudo-labels reliably, resulting in reduced prediction errors. The key novelty in our model, IDoFew, is that the two-stage clustering coupled with two different clustering algorithms helps exploit the advantages of the complementary algorithms that reduce the errors in generating reliable pseudo-labels for fine-tuning. Our approach has shown significant improvements compared to strong comparative models.
Paper Structure (11 sections, 7 equations, 4 figures, 6 tables)

This paper contains 11 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: l Intermediate Dual-Clustering for Few Labels in text classification (IDoFew). PTM indicate to pre-trained models -- BERT, RoBERTa, and DistilBERT. Dash-dots produce the pseudo labels for each model.
  • Figure 2: Number of clusters For Be-SIB-KMeans$_{FT}$
  • Figure 3: Accuracy results for our model Be-SIB-KMeans$_{FT}$, Be-SIB-KMeans and Be-SIB-SIB with limited samples in the fine-tune model.
  • Figure 4: Different components IDoFew are presented in the ablation study.