Generalization Error Bounds for Learning under Censored Feedback

Yifan Yang; Ali Payani; Parinaz Naghizadeh

Generalization Error Bounds for Learning under Censored Feedback

Yifan Yang, Ali Payani, Parinaz Naghizadeh

TL;DR

The paper addresses generalization guarantees for learning under censored feedback, where true labels are revealed only for favorable decisions, causing non-IID data. By decomposing the data into IID blocks corresponding to censored and disclosed regions, the authors extend the DKW inequality to censored settings with and without exploration, and link these CDF estimation errors to classifier generalization performance. They derive ex-post and ex-ante bounds, analyze how exploration can tighten guarantees despite costs, and validate the results with numerical simulations and real-world datasets, showing that standard IID-based bounds fail to capture the true guarantees under censoring. The work provides actionable guidance for balancing data-collection costs and learning guarantees, and lays groundwork for distribution-aware and potentially distribution-free extensions in higher dimensions.

Abstract

Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. Censored feedback is ubiquitous in many real-world online selection and classification tasks (e.g., hiring, lending, recommendation systems) where the true label of a data point is only revealed if a favorable decision is made (e.g., accepting a candidate, approving a loan, displaying an ad), and remains unknown otherwise. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical data distribution CDFs learned from IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.

Generalization Error Bounds for Learning under Censored Feedback

TL;DR

Abstract

Paper Structure (32 sections, 17 theorems, 44 equations, 14 figures, 1 table)

This paper contains 32 sections, 17 theorems, 44 equations, 14 figures, 1 table.

Introduction
Our approach
Summary of findings and contributions
Related Works
Problem Setting
Error Bounds on Cumulative Distribution Function Estimates (Ex-post Analysis)
CDF bounds under censored feedback
CDF bounds under censored feedback and exploration
When will exploration improve generalization guarantees?
How to choose an exploration strategy?
Generalization Error Bounds under Censored Feedback (Ex-post Analysis)
Numerical Experiments
CDF error bounds
Model generalization error bounds: real-world data
Comparison with existing generalization error bounds
...and 17 more sections

Key Result

Theorem 1

Let $Z_1, \ldots, Z_n$ be IID real-valued random variables with cumulative distribution function $F(z) = \mathbb{P}(Z_1\leq z)$. Let the empirical distribution function be $F_n(z) = \frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(Z_i\leq z)$. Then, for every $n$ and $\eta {> 0}$,

Figures (14)

Figure 1: Illustration on the difference between the ex-post analysis and the ex-ante analysis cases.
Figure 2: The empirical CDFs $F_{n+k}$ (Full domain), $G_m$ (Censored region), and $K_{n-m+k}$ (Disclosed region), and the theoretical CDFs of $F$, $G,$ and $K$. Experiments based on randomly drawn samples from Gaussian data $N(7,1)$, $\theta=7$, $n=50$, $m=24$, and $k=0$.
Figure 3: Behavior of the censored and disclosed region error terms.
Figure 4: The empirical CDFs $F_{n+k_e+k_d}$ (Full domain), $G_l$ (Censored region), $E_{m-l+k_e}$ (Explored region), and $K_{n-m+k_d}$ (Disclosed region), and the theoretical CDFs of $F, G, E,$ and $K$. Experiments based on randomly drawn samples from Gaussian data $N(7,1)$, $\theta=7$$LB = 6$, $n=50$, $l=7, m=27$, and $k_e=k_d=0$.
Figure 5: A minimum exploration frequency is needed to tighten the CDF error bound.
...and 9 more figures

Theorems & Definitions (28)

Remark 1
Theorem 1: The DKW inequality dvoretzky1956asymptoticmassart1990tight
Lemma 1: Censored Region
Lemma 2: Disclosed Region
Theorem 2
Corollary 1
Lemma 3: Exploration Region
Theorem 3
Proposition 1
Proposition 2
...and 18 more

Generalization Error Bounds for Learning under Censored Feedback

TL;DR

Abstract

Generalization Error Bounds for Learning under Censored Feedback

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (28)