Safe Semi-Supervised Contrastive Learning Using In-Distribution Data as Positive Examples
Min Gu Kwak, Hyungu Kahng, Seoung Bum Kim
TL;DR
This work tackles the practical problem of class distribution mismatch in semi-supervised learning by proposing Safe Semi-Supervised Contrastive Learning (SSCL), which leverages in-distribution data as additional positives within a MoCo-based self-supervised contrastive framework. A novel loss, L_i^{ID}, reuses labeled negatives of the same class as positives, and a coefficient schedule w(t) gradually reduces its influence to prevent overfitting, while a memory queue preserves class information for ID-aware sampling. Empirical results on CIFAR-10, CIFAR-100, Tiny ImageNet, and CIFAR-100+Tiny ImageNet under varied mismatch ratios show that SSCL improves representation quality and downstream classification, often outperforming strong baselines and safe SSL methods, with larger gains in challenging scenarios. The approach demonstrates that incorporating ID information through selective positive augmentation and a principled schedule yields robust, scalable improvements without discarding unlabeled OOD data, and suggests avenues for adaptive scheduling and stronger augmentations in future work.
Abstract
Semi-supervised learning methods have shown promising results in solving many practical problems when only a few labels are available. The existing methods assume that the class distributions of labeled and unlabeled data are equal; however, their performances are significantly degraded in class distribution mismatch scenarios where out-of-distribution (OOD) data exist in the unlabeled data. Previous safe semi-supervised learning studies have addressed this problem by making OOD data less likely to affect training based on labeled data. However, even if the studies effectively filter out the unnecessary OOD data, they can lose the basic information that all data share regardless of class. To this end, we propose to apply a self-supervised contrastive learning approach to fully exploit a large amount of unlabeled data. We also propose a contrastive loss function with coefficient schedule to aggregate as an anchor the labeled negative examples of the same class into positive examples. To evaluate the performance of the proposed method, we conduct experiments on image classification datasets - CIFAR-10, CIFAR-100, Tiny ImageNet, and CIFAR-100+Tiny ImageNet - under various mismatch ratios. The results show that self-supervised contrastive learning significantly improves classification accuracy. Moreover, aggregating the in-distribution examples produces better representation and consequently further improves classification accuracy.
