Table of Contents
Fetching ...

Generalization Error Bounds for Learning under Censored Feedback

Yifan Yang, Ali Payani, Parinaz Naghizadeh

TL;DR

The paper addresses generalization guarantees for learning under censored feedback, where true labels are revealed only for favorable decisions, causing non-IID data. By decomposing the data into IID blocks corresponding to censored and disclosed regions, the authors extend the DKW inequality to censored settings with and without exploration, and link these CDF estimation errors to classifier generalization performance. They derive ex-post and ex-ante bounds, analyze how exploration can tighten guarantees despite costs, and validate the results with numerical simulations and real-world datasets, showing that standard IID-based bounds fail to capture the true guarantees under censoring. The work provides actionable guidance for balancing data-collection costs and learning guarantees, and lays groundwork for distribution-aware and potentially distribution-free extensions in higher dimensions.

Abstract

Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. Censored feedback is ubiquitous in many real-world online selection and classification tasks (e.g., hiring, lending, recommendation systems) where the true label of a data point is only revealed if a favorable decision is made (e.g., accepting a candidate, approving a loan, displaying an ad), and remains unknown otherwise. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical data distribution CDFs learned from IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.

Generalization Error Bounds for Learning under Censored Feedback

TL;DR

The paper addresses generalization guarantees for learning under censored feedback, where true labels are revealed only for favorable decisions, causing non-IID data. By decomposing the data into IID blocks corresponding to censored and disclosed regions, the authors extend the DKW inequality to censored settings with and without exploration, and link these CDF estimation errors to classifier generalization performance. They derive ex-post and ex-ante bounds, analyze how exploration can tighten guarantees despite costs, and validate the results with numerical simulations and real-world datasets, showing that standard IID-based bounds fail to capture the true guarantees under censoring. The work provides actionable guidance for balancing data-collection costs and learning guarantees, and lays groundwork for distribution-aware and potentially distribution-free extensions in higher dimensions.

Abstract

Generalization error bounds from learning theory provide statistical guarantees on how well an algorithm will perform on previously unseen data. In this paper, we characterize the impacts of data non-IIDness due to censored feedback (a.k.a. selective labeling bias) on such bounds. Censored feedback is ubiquitous in many real-world online selection and classification tasks (e.g., hiring, lending, recommendation systems) where the true label of a data point is only revealed if a favorable decision is made (e.g., accepting a candidate, approving a loan, displaying an ad), and remains unknown otherwise. We first derive an extension of the well-known Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, which characterizes the gap between empirical and theoretical data distribution CDFs learned from IID data, to problems with non-IID data due to censored feedback. We then use this CDF error bound to provide a bound on the generalization error guarantees of a classifier trained on such non-IID data. We show that existing generalization error bounds (which do not account for censored feedback) fail to correctly capture the model's generalization guarantees, verifying the need for our bounds. We further analyze the effectiveness of (pure and bounded) exploration techniques, proposed by recent literature as a way to alleviate censored feedback, on improving our error bounds. Together, our findings illustrate how a decision maker should account for the trade-off between strengthening the generalization guarantees of an algorithm and the costs incurred in data collection when future data availability is limited by censored feedback.
Paper Structure (32 sections, 17 theorems, 44 equations, 14 figures, 1 table)

This paper contains 32 sections, 17 theorems, 44 equations, 14 figures, 1 table.

Key Result

Theorem 1

Let $Z_1, \ldots, Z_n$ be IID real-valued random variables with cumulative distribution function $F(z) = \mathbb{P}(Z_1\leq z)$. Let the empirical distribution function be $F_n(z) = \frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(Z_i\leq z)$. Then, for every $n$ and $\eta {> 0}$,

Figures (14)

  • Figure 1: Illustration on the difference between the ex-post analysis and the ex-ante analysis cases.
  • Figure 2: The empirical CDFs $F_{n+k}$ (Full domain), $G_m$ (Censored region), and $K_{n-m+k}$ (Disclosed region), and the theoretical CDFs of $F$, $G,$ and $K$. Experiments based on randomly drawn samples from Gaussian data $N(7,1)$, $\theta=7$, $n=50$, $m=24$, and $k=0$.
  • Figure 3: Behavior of the censored and disclosed region error terms.
  • Figure 4: The empirical CDFs $F_{n+k_e+k_d}$ (Full domain), $G_l$ (Censored region), $E_{m-l+k_e}$ (Explored region), and $K_{n-m+k_d}$ (Disclosed region), and the theoretical CDFs of $F, G, E,$ and $K$. Experiments based on randomly drawn samples from Gaussian data $N(7,1)$, $\theta=7$$LB = 6$, $n=50$, $l=7, m=27$, and $k_e=k_d=0$.
  • Figure 5: A minimum exploration frequency is needed to tighten the CDF error bound.
  • ...and 9 more figures

Theorems & Definitions (28)

  • Remark 1
  • Theorem 1: The DKW inequality dvoretzky1956asymptoticmassart1990tight
  • Lemma 1: Censored Region
  • Lemma 2: Disclosed Region
  • Theorem 2
  • Corollary 1
  • Lemma 3: Exploration Region
  • Theorem 3
  • Proposition 1
  • Proposition 2
  • ...and 18 more