Learning with Selectively Labeled Data from Multiple Decision-makers
Jian Chen, Zhehao Li, Xiaojie Mao
TL;DR
This paper tackles classification with selectively labeled data arising from historical decisions by multiple decision-makers. By casting the problem in an instrumental-variable framework with multi-valued IVs, it derives conditions for exact identification of the latent classification risk and, when such identification fails, tight partial bounds. It then introduces Unified Cost-sensitive Learning (UCL) to learn classifiers robust to selection bias under both the point-identified and partially identified regimes, using calibrated surrogates and cross-fitting for practical estimation. Theoretical guarantees are provided for identification and generalization, and extensive simulations demonstrate that PartialLearning, a robust partial-identification approach, achieves near-ideal performance even under strong selection bias. This work offers a principled, nonparametric route to learning under MNAR with heterogeneous decision-makers, with practical impact for fair and reliable deployment in high-stakes decision systems.
Abstract
We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. Finally, we theoretically and numerically validate the efficacy of our proposed method.
