Learning with Selectively Labeled Data from Multiple Decision-makers

Jian Chen; Zhehao Li; Xiaojie Mao

Learning with Selectively Labeled Data from Multiple Decision-makers

Jian Chen, Zhehao Li, Xiaojie Mao

TL;DR

This paper tackles classification with selectively labeled data arising from historical decisions by multiple decision-makers. By casting the problem in an instrumental-variable framework with multi-valued IVs, it derives conditions for exact identification of the latent classification risk and, when such identification fails, tight partial bounds. It then introduces Unified Cost-sensitive Learning (UCL) to learn classifiers robust to selection bias under both the point-identified and partially identified regimes, using calibrated surrogates and cross-fitting for practical estimation. Theoretical guarantees are provided for identification and generalization, and extensive simulations demonstrate that PartialLearning, a robust partial-identification approach, achieves near-ideal performance even under strong selection bias. This work offers a principled, nonparametric route to learning under MNAR with heterogeneous decision-makers, with practical impact for fair and reliable deployment in high-stakes decision systems.

Abstract

We study the problem of classification with selectively labeled data, whose distribution may differ from the full population due to historical decision-making. We exploit the fact that in many applications historical decisions were made by multiple decision-makers, each with different decision rules. We analyze this setup under a principled instrumental variable (IV) framework and rigorously study the identification of classification risk. We establish conditions for the exact identification of classification risk and derive tight partial identification bounds when exact identification fails. We further propose a unified cost-sensitive learning (UCL) approach to learn classifiers robust to selection bias in both identification settings. Finally, we theoretically and numerically validate the efficacy of our proposed method.

Learning with Selectively Labeled Data from Multiple Decision-makers

TL;DR

Abstract

Paper Structure (37 sections, 14 theorems, 93 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 14 theorems, 93 equations, 7 figures, 1 table, 1 algorithm.

Introduction
Related Work
Problem Formulation
Learning under Selective Labels
Multiple Decision-makers and IV
Exact Identification of Classification Risk
Partial Identification of Classification Risk
Unified Cost-sensitive Learning
Calibrated Surrogate Risk
Empirical Risk Minimization
Numeric Experiments
Conclusion
Supplements for Point Identification
Proofs in \ref{['sec:exact-identification']}
Sufficient Condition for Point Identification
...and 22 more sections

Key Result

Lemma 4.1

The excess risk $\mathcal{R}(t, s^{\star})$ satisfies

Figures (7)

Figure 1: Causal graph of selective label problem. The dashed nodes $U$ and $Y^\star$ are unobserved. The dashed line from $X$ to $Z$ means that $Z$ is allowed but not required to be affected by $X$.
Figure 2: The testing accuracy of methods with $\alpha \in \{0.5, 0.7, 0.9\}$ for model NUCEM in FICO dataset.
Figure 3: The testing accuracy of methods with $\alpha \in \{0.5, 0.7, 0.9\}$ for model UC in FICO dataset.
Figure 4: The testing accuracy of different methods with $\alpha_Y \in \{0.3, 0.5, 0.7\}$ and $\alpha_D \in \{0.3, 0.5, 0.7\}$ of Model NUCEM in synthetic dataset.
Figure 5: The testing accuracy of different methods with $\alpha_Y \in \{0.3, 0.5, 0.7\}$ and $\alpha_D \in \{0.3, 0.5, 0.7\}$ under Model UC of synthetic dataset.
...and 2 more figures

Theorems & Definitions (30)

Lemma 4.1
Theorem 4.3: Point Identification
Theorem 4.4
Example 1: Full Information
Example 2: Separable and Independent Unobservables
Example 3: Additive Decision Probability
Theorem 5.2
Theorem 5.3
Theorem 6.1: Calibration Bound for Cost-sensitive Risk
Proposition 6.2: Calibration Bound for Weighted Risk
...and 20 more

Learning with Selectively Labeled Data from Multiple Decision-makers

TL;DR

Abstract

Learning with Selectively Labeled Data from Multiple Decision-makers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (30)