Table of Contents
Fetching ...

Learning with Selectively Labeled Data from Multiple Decision-makers

Jian Chen, Zhehao Li, Xiaojie Mao

TL;DR

This paper tackles classification with selectively labeled data arising from historical decisions by multiple decision-makers. By casting the problem in an instrumental-variable framework with multi-valued IVs, it derives conditions for exact identification of the latent classification risk and, when such identification fails, tight partial bounds. It then introduces Unified Cost-sensitive Learning (UCL) to learn classifiers robust to selection bias under both the point-identified and partially identified regimes, using calibrated surrogates and cross-fitting for practical estimation. Theoretical guarantees are provided for identification and generalization, and extensive simulations demonstrate that PartialLearning, a robust partial-identification approach, achieves near-ideal performance even under strong selection bias. This work offers a principled, nonparametric route to learning under MNAR with heterogeneous decision-makers, with practical impact for fair and reliable deployment in high-stakes decision systems.

Abstract

We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. Finally, we theoretically and numerically validate the efficacy of our proposed method.

Learning with Selectively Labeled Data from Multiple Decision-makers

TL;DR

This paper tackles classification with selectively labeled data arising from historical decisions by multiple decision-makers. By casting the problem in an instrumental-variable framework with multi-valued IVs, it derives conditions for exact identification of the latent classification risk and, when such identification fails, tight partial bounds. It then introduces Unified Cost-sensitive Learning (UCL) to learn classifiers robust to selection bias under both the point-identified and partially identified regimes, using calibrated surrogates and cross-fitting for practical estimation. Theoretical guarantees are provided for identification and generalization, and extensive simulations demonstrate that PartialLearning, a robust partial-identification approach, achieves near-ideal performance even under strong selection bias. This work offers a principled, nonparametric route to learning under MNAR with heterogeneous decision-makers, with practical impact for fair and reliable deployment in high-stakes decision systems.

Abstract

We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. Finally, we theoretically and numerically validate the efficacy of our proposed method.
Paper Structure (37 sections, 14 theorems, 93 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 14 theorems, 93 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 4.1

The excess risk $\mathcal{R}(t, s^{\star})$ satisfies

Figures (7)

  • Figure 1: Causal graph of selective label problem. The dashed nodes $U$ and $Y^\star$ are unobserved. The dashed line from $X$ to $Z$ means that $Z$ is allowed but not required to be affected by $X$.
  • Figure 2: The testing accuracy of methods with $\alpha \in \{0.5, 0.7, 0.9\}$ for model NUCEM in FICO dataset.
  • Figure 3: The testing accuracy of methods with $\alpha \in \{0.5, 0.7, 0.9\}$ for model UC in FICO dataset.
  • Figure 4: The testing accuracy of different methods with $\alpha_Y \in \{0.3, 0.5, 0.7\}$ and $\alpha_D \in \{0.3, 0.5, 0.7\}$ of Model NUCEM in synthetic dataset.
  • Figure 5: The testing accuracy of different methods with $\alpha_Y \in \{0.3, 0.5, 0.7\}$ and $\alpha_D \in \{0.3, 0.5, 0.7\}$ under Model UC of synthetic dataset.
  • ...and 2 more figures

Theorems & Definitions (30)

  • Lemma 4.1
  • Theorem 4.3: Point Identification
  • Theorem 4.4
  • Example 1: Full Information
  • Example 2: Separable and Independent Unobservables
  • Example 3: Additive Decision Probability
  • Theorem 5.2
  • Theorem 5.3
  • Theorem 6.1: Calibration Bound for Cost-sensitive Risk
  • Proposition 6.2: Calibration Bound for Weighted Risk
  • ...and 20 more