Table of Contents
Fetching ...

Exploratory Machine Learning with Unknown Unknowns

Peng Zhao, Jia-Wei Shan, Yu-Jie Zhang, Zhi-Hua Zhou

TL;DR

This work tackles learning in the presence of unknown unknowns created by insufficient feature information, where hidden classes may be misperceived as known ones during training. It introduces Exploratory Machine Learning (ExML), a framework combining a rejection model, feature exploration, and model cascade to both classify known classes and uncover hidden ones by augmenting the feature space under a budget. Theoretical results show that ExML can achieve favorable excess-risk guarantees relative to standard supervised learning, using uniform allocation or median-elimination strategies to identify informative features. Empirical evaluation on synthetic and real-world datasets demonstrates that ExML improves robustness to hidden classes, with budget-aware feature exploration effectively focusing on high-quality, cost-efficient features. The framework offers a principled approach to open-world learning under feature deficiency and suggests broader applicability to decision-making contexts with unknown unknowns.

Abstract

In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to known labels. This paper studies a new problem setting in which there are unknown classes in the training data misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates training data by actively augmenting the feature space to discover potentially hidden classes. Our method consists of three ingredients including rejection model, feature exploration, and model cascade. We provide theoretical analysis to justify its superiority, and validate the effectiveness on both synthetic and real datasets.

Exploratory Machine Learning with Unknown Unknowns

TL;DR

This work tackles learning in the presence of unknown unknowns created by insufficient feature information, where hidden classes may be misperceived as known ones during training. It introduces Exploratory Machine Learning (ExML), a framework combining a rejection model, feature exploration, and model cascade to both classify known classes and uncover hidden ones by augmenting the feature space under a budget. Theoretical results show that ExML can achieve favorable excess-risk guarantees relative to standard supervised learning, using uniform allocation or median-elimination strategies to identify informative features. Empirical evaluation on synthetic and real-world datasets demonstrates that ExML improves robustness to hidden classes, with budget-aware feature exploration effectively focusing on high-quality, cost-efficient features. The framework offers a principled approach to open-world learning under feature deficiency and suggests broader applicability to decision-making contexts with unknown unknowns.

Abstract

In conventional supervised learning, a training dataset is given with ground-truth labels from a known label set, and the learned model will classify unseen instances to known labels. This paper studies a new problem setting in which there are unknown classes in the training data misperceived as other labels, and thus their existence appears unknown from the given supervision. We attribute the unknown unknowns to the fact that the training dataset is badly advised by the incompletely perceived label space due to the insufficient feature information. To this end, we propose the exploratory machine learning, which examines and investigates training data by actively augmenting the feature space to discover potentially hidden classes. Our method consists of three ingredients including rejection model, feature exploration, and model cascade. We provide theoretical analysis to justify its superiority, and validate the effectiveness on both synthetic and real datasets.

Paper Structure

This paper contains 43 sections, 6 theorems, 48 equations, 9 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Let $a_{i_s}$ be the feature identified by uniform allocation, then uniform allocation identifies the best feature (i.e., $i_s=1$) with probability at least $1-\delta_{\text{fail}}$, where providing that the identification condition $\lfloor B/K \rfloor > \frac{16((1-\theta)\kappa\Lambda)^2}{((1-2\theta)\Delta)^2}$ holds, with $\theta$ the threshold of rejection model defined in eq:surrogate-loss

Figures (9)

  • Figure 1: Unknown unknowns in the task of medical diagnosis. Patients with lung cancer are misdiagnosed as asthma or pneumonia due to the lack of CT scan devices, and thus appear as unknown to the learned model.
  • Figure 2: An example illustrates that an informative feature can substantially improve separability of low-confidence samples and make the hidden class distinguishable.
  • Figure 3: Comparison of two learning frameworks. Conventional supervised learning exploits the observable dataset for prediction. Exploratory machine learning explores more features based on the operational dataset for both prediction and discovery of the hidden classes.
  • Figure 4: The left figure shows the overall procedure of ExML. Our approach begins with an initial model (blue part), followed by exploring the best candidate feature among the candidates (green part). Afterwards, a learned model is retrained based on the augmented dataset, and finally is cascaded with the initial model to discover the hidden class (red part). The right figure describes the procedure of the feature exploration in ExML.
  • Figure 5: Visualization of synthetic data. (a): ground-truth distribution; (b): training data (only first two dims are observable); (c): $t$-SNE of candidate features with various qualities (larger angles imply better features).
  • ...and 4 more figures

Theorems & Definitions (19)

  • Remark 1: Possible relaxations of some assumptions
  • Remark 2: Training-time and test-time feature cost
  • Remark 3: Reliability of the initial model
  • Remark 4: Most informative feature assumption over 0/1 loss
  • Lemma 1: Exploratory regret of uniform allocation
  • Remark 5: Launch budget in feature exploration
  • Theorem 1: Excess risk of ExML with uniform allocation
  • Remark 6: Comparison between excess risk of SL and ExML
  • Lemma 2: Exploratory regret of median elimination
  • Theorem 2: Excess risk of ExML with median elimination
  • ...and 9 more