Investigating Semi-Supervised Learning Algorithms in Text Datasets
Himmet Toprak Kesgin, Mehmet Fatih Amasyali
TL;DR
This paper tackles the challenge of applying semi-supervised learning to text when augmentation is ineffective. It systematically compares non-augmentation proxy-label methods—self-training, co-training, tri-training, and tri-training with disagreement—on four Turkish text classification tasks using Berturk embeddings. Tri-training with disagreement emerges as the most promising approach, often approaching the Oracle but leaving a noticeable gap to the ideal upper bound, highlighting the need for new methods. The study also characterizes trade-offs such as the benefits of training from scratch for self-training and the dataset-dependent performance of ensembles, offering practical guidance for SSL deployment in low-label scenarios. The results motivate future research toward robust text SSL algorithms.
Abstract
Using large training datasets enhances the generalization capabilities of neural networks. Semi-supervised learning (SSL) is useful when there are few labeled data and a lot of unlabeled data. SSL methods that use data augmentation are most successful for image datasets. In contrast, texts do not have consistent augmentation methods as images. Consequently, methods that use augmentation are not as effective in text data as they are in image data. In this study, we compared SSL algorithms that do not require augmentation; these are self-training, co-training, tri-training, and tri-training with disagreement. In the experiments, we used 4 different text datasets for different tasks. We examined the algorithms from a variety of perspectives by asking experiment questions and suggested several improvements. Among the algorithms, tri-training with disagreement showed the closest performance to the Oracle; however, performance gap shows that new semi-supervised algorithms or improvements in existing methods are needed.
