Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Yuxin Liu; Xiong Jin; Yang Han

Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Yuxin Liu, Xiong Jin, Yang Han

Abstract

Label noise - incorrect labels assigned to observations - can substantially degrade the performance of supervised classifiers. This paper proposes a label noise cleaning method based on Bernoulli random sampling. We show that the mean label noise levels of subsets generated by Bernoulli random sampling containing a given observation are identically distributed for all clean observations, and identically distributed, with a different distribution, for all noisy observations. Although the mean label noise levels are not independent across observations, by introducing an independent coupling we further prove that they converge to a mixture of two well-separated distributions corresponding to clean and noisy observations. By establishing a linear model between cross-validated classification errors and label noise levels, we are able to approximate this mixture distribution and thereby separate clean and noisy observations without any prior label information. The proposed method is classifier-agnostic, theoretically justified, and demonstrates strong performance on both simulated and real datasets.

Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Abstract

Paper Structure (14 sections, 7 theorems, 43 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 7 theorems, 43 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Problem setup and theoretical basis for separation
Dataset with label noise, supervised classification, and assumptions
A positively quadrant dependent (PQD) sequence of random variables {lz} and its separated empirical distribution
An independent coupling and the approximate empirical distribution of {lz}
Separation of clean and noisy observations using averaged cross-validation supervised classification error
Linear modeling of {lz} via averaged cross-validation error
Cut point selection for separating clean and noisy observations
The stepwise label noise cleaning algorithm (BRSLC)
Experiments
Experiments on simulated data with artificial label noise
Experiments on real-world data with artificial label noise
Experiments on real-world data without artificial label noise
Conclusions

Key Result

Lemma 2.1

The distributions $l_1$ and $l_2$ are given by where $T_1 \sim \mathrm{Bin}(N_1 - 1, q)$, $T_2 \sim \mathrm{Bin}(N_2 - 1, q)$, and $\xi \sim \mathrm{Be}(q)$, with $\xi$, $T_1$, and $T_2$ mutually independent. Let $\mu_1 = \mathbb{E}[L_1]$ and $\mu_2 = \mathbb{E}[L_2]$ denote the expectations of $L_1$ and $L_2$, then and

Figures (10)

Figure 1: Empirical distribution of $\{\tilde{l}_{\bm{z}}\}_{\bm{z} \in D}$ obtained from a dataset with $N = 1000$ observations ($N_1 = 800$ clean and $N_2 = 200$ noisy). The values of $\tilde{l}_{\bm{z}}$ are computed under the Bernoulli random sampling scheme with $M = 2 \times 10^6$ subsets and inclusion probability $q = 0.4$. The histogram shows two well-separated laws corresponding to $L_1$ and $L_2$, with a mean gap $\mathbb{E}\!\left[\tfrac{1 - q}{(N - 1)q + 1}\right] \approx 0.0015$.
Figure 2: Empirical distribution of the i.i.d. sequence $\{\tilde{l}'_{\bm{z}}\}_{\bm{z} \in D}$ generated under the same settings as Figure \ref{['fsupp1']}. The mixture shows two components, approximately corresponding to $L_1$ and $L_2$, with a mean gap of $\tfrac{1 - q}{(N - 1)q + 1} \approx 0.0015$, consistent with Lemma \ref{['lemma3']}.
Figure 3: Linear model between $\tilde{l}_{\bm{z}}$ and $\tilde{e}_{\bm{z}}$ under Setting 1 with 20% label noise. Results are shown for classifiers trained with RBF SVM and 1-NN, as described in Section \ref{['sec4']}.
Figure 4: Comparison between the empirical and model-estimated distributions of $\tilde{l}_{\bm{z}}$. The estimated distribution $\hat{l}_{\bm{z}}$ closely matches the empirical $\tilde{l}_{\bm{z}}$, and the estimated cut point $\hat{l}^{*}$ aligns well with the true $l^{*}$.
Figure 5: Three-iteration BRSLC performance of RBF SVM and 1-NN. The legend $< e^{*}$ denotes $\frac{\#\{\bm{z}\in D_1\mid \tilde{e}_{\bm{z}} < e^{*}\}}{\#\{\bm{z}\in D\mid \tilde{e}_{\bm{z}} < e^{*}\}}$, $\ge e^{*}$ denotes $\frac{\#\{\bm{z}\in D_2\mid \tilde{e}_{\bm{z}} \ge e^{*}\}}{\#\{\bm{z}\in D\mid \tilde{e}_{\bm{z}} \ge e^{*}\}}$.
...and 5 more figures

Theorems & Definitions (12)

Definition 2.1
Definition 2.2: Label Noise Level
Lemma 2.1
Definition 2.3: PQD Random Variables
Lemma 2.2
Lemma 2.3
Lemma 2.4
Corollary 2.1
Definition 3.1
Lemma 3.1
...and 2 more

Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Abstract

Label Noise Cleaning for Supervised Classification via Bernoulli Random Sampling

Authors

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)