Exploratory Machine Learning with Unknown Unknowns
Peng Zhao, Jia-Wei Shan, Yu-Jie Zhang, Zhi-Hua Zhou
TL;DR
This work tackles learning in the presence of unknown unknowns created by insufficient feature information, where hidden classes may be misperceived as known ones during training. It introduces Exploratory Machine Learning (ExML), a framework combining a rejection model, feature exploration, and model cascade to both classify known classes and uncover hidden ones by augmenting the feature space under a budget. Theoretical results show that ExML can achieve favorable excess-risk guarantees relative to standard supervised learning, using uniform allocation or median-elimination strategies to identify informative features. Empirical evaluation on synthetic and real-world datasets demonstrates that ExML improves robustness to hidden classes, with budget-aware feature exploration effectively focusing on high-quality, cost-efficient features. The framework offers a principled approach to open-world learning under feature deficiency and suggests broader applicability to decision-making contexts with unknown unknowns.
Abstract
In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to known labels. This paper studies a new problem setting in which there are unknown classes in the training data misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates training data by actively augmenting the feature space to discover potentially hidden classes. Our method consists of three ingredients including rejection model, feature exploration, and model cascade. We provide theoretical analysis to justify its superiority, and validate the effectiveness on both synthetic and real datasets.
