Table of Contents
Fetching ...

Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets

Payam Karisani

TL;DR

This work tackles semi-supervised text classification under limited labeled data by addressing semantic drift and overconfident pseudo-labels in self-training. It introduces Robust Self-Training (RST), which uses a hierarchical, iteration-aware pseudo-label strategy and a subsampling-based Score(d) that accounts for prediction uncertainty to select high-quality pseudo-labels. Empirical results on five benchmarks show that RST outperforms ten baselines and provides additive gains when combined with domain-specific language model pretraining, highlighting its practical value for data-scarce NLP tasks. The approach offers a general framework for robust semi-supervised learning with clear mechanisms to stabilize bootstrapping and calibrate confidence in predictions, with potential extensions to cross-lingual scenarios.

Abstract

We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate over iterations and consequently the error rate soars. In order to tackle this challenge, we reshape the role of pseudo-labels and create a hierarchical order of information. In addition, a crucial step in self-training is to use the classifier confidence prediction to select the best candidate pseudo-labels. This step cannot be efficiently done by neural networks, because it is known that their output is poorly calibrated. To overcome this challenge, we propose a hybrid metric to replace the plain confidence measurement. Our metric takes into account the prediction uncertainty via a subsampling technique. We evaluate our model in a set of five standard benchmarks, and show that it significantly outperforms a set of ten diverse baseline models. Furthermore, we show that the improvement achieved by our model is additive to language model pretraining, which is a widely used technique for using unlabeled documents. Our code is available at https://github.com/p-karisani/RST.

Neural Networks Against (and For) Self-Training: Classification with Small Labeled and Large Unlabeled Sets

TL;DR

This work tackles semi-supervised text classification under limited labeled data by addressing semantic drift and overconfident pseudo-labels in self-training. It introduces Robust Self-Training (RST), which uses a hierarchical, iteration-aware pseudo-label strategy and a subsampling-based Score(d) that accounts for prediction uncertainty to select high-quality pseudo-labels. Empirical results on five benchmarks show that RST outperforms ten baselines and provides additive gains when combined with domain-specific language model pretraining, highlighting its practical value for data-scarce NLP tasks. The approach offers a general framework for robust semi-supervised learning with clear mechanisms to stabilize bootstrapping and calibrate confidence in predictions, with potential extensions to cross-lingual scenarios.

Abstract

We propose a semi-supervised text classifier based on self-training using one positive and one negative property of neural networks. One of the weaknesses of self-training is the semantic drift problem, where noisy pseudo-labels accumulate over iterations and consequently the error rate soars. In order to tackle this challenge, we reshape the role of pseudo-labels and create a hierarchical order of information. In addition, a crucial step in self-training is to use the classifier confidence prediction to select the best candidate pseudo-labels. This step cannot be efficiently done by neural networks, because it is known that their output is poorly calibrated. To overcome this challenge, we propose a hybrid metric to replace the plain confidence measurement. Our metric takes into account the prediction uncertainty via a subsampling technique. We evaluate our model in a set of five standard benchmarks, and show that it significantly outperforms a set of ten diverse baseline models. Furthermore, we show that the improvement achieved by our model is additive to language model pretraining, which is a widely used technique for using unlabeled documents. Our code is available at https://github.com/p-karisani/RST.
Paper Structure (14 sections, 5 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: \ref{['fig:curve-unlabeled']}) F1 of RST and Self-pretraining at varying unlabeled set sizes. \ref{['fig:subsample-ratio']}) The sensitivity of RST to the sample ratio.
  • Figure 2: The sensitivity of RST to the number of classifiers. We see that our model reaches the highest performance when three classifiers are used.
  • Figure 3: \ref{['fig:lambda']}) The sensitivity of RST to the penalty term $\lambda$. \ref{['fig:convergence']}) The convergence rate of RST when we use the regular cross entropy instead of our loss function. The modified method is denoted by RST (CE).