Robust Semi-supervised Learning by Wisely Leveraging Open-set Data
Yang Yang, Nan Jiang, Yi Xu, De-Chuan Zhan
TL;DR
The paper addresses the realistic OSSL setting where unlabeled data include unseen classes and demonstrates that indiscriminately using all open-set data can harm ID generalization. It proposes WiseOpen, a gradient-variance-based data-selection framework, to wisefully leverage friendly open-set samples while discarding unfriendly ones; two practical variants WiseOpen-E and WiseOpen-L balance accuracy gains with computation. Theoretical analysis links gradient variance to generalization, and extensive experiments on CIFAR-10/100 and Tiny-ImageNet show consistent ID accuracy improvements and competitive OOD detection. The approach is designed as a plug-in module that can enhance existing OSSL methods like OpenMatch and IOMatch, offering a practical pathway toward more robust open-set learning in real-world data distributions.
Abstract
Open-set Semi-supervised Learning (OSSL) holds a realistic setting that unlabeled data may come from classes unseen in the labeled set, i.e., out-of-distribution (OOD) data, which could cause performance degradation in conventional SSL models. To handle this issue, except for the traditional in-distribution (ID) classifier, some existing OSSL approaches employ an extra OOD detection module to avoid the potential negative impact of the OOD data. Nevertheless, these approaches typically employ the entire set of open-set data during their training process, which may contain data unfriendly to the OSSL task that can negatively influence the model performance. This inspires us to develop a robust open-set data selection strategy for OSSL. Through a theoretical understanding from the perspective of learning theory, we propose Wise Open-set Semi-supervised Learning (WiseOpen), a generic OSSL framework that selectively leverages the open-set data for training the model. By applying a gradient-variance-based selection mechanism, WiseOpen exploits a friendly subset instead of the whole open-set dataset to enhance the model's capability of ID classification. Moreover, to reduce the computational expense, we also propose two practical variants of WiseOpen by adopting low-frequency update and loss-based selection respectively. Extensive experiments demonstrate the effectiveness of WiseOpen in comparison with the state-of-the-art.
