Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research

Sjoerd de Vries; Dirk Thierens

Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research

Sjoerd de Vries, Dirk Thierens

TL;DR

SYNLABEL is introduced, a framework designed to address limitations of existing research on label noise by creating noiseless datasets informed by real-world data and its ability to precisely quantify label noise and its improvement over existing methodologies.

Abstract

In many real-world classification tasks, label noise is an unavoidable issue that adversely affects the generalization error of machine learning models. Additionally, evaluating how methods handle such noise is complicated, as the effect label noise has on their performance cannot be accurately quantified without clean labels. Existing research on label noise typically relies on either noisy or oversimplified simulated data as a baseline, into which additional noise with known properties is injected. In this paper, we introduce SYNLABEL, a framework designed to address these limitations by creating noiseless datasets informed by real-world data. SYNLABEL supports defining a pre-specified or learned function as the ground truth function, which can then be used for generating new clean labels. Furthermore, by repeatedly resampling values for selected features within the domain of the function, evaluating the function and aggregating the resulting labels, each data point can be assigned a soft label or label distribution. These distributions capture the inherent uncertainty present in many real-world datasets and enable the direct injection and quantification of label noise. The generated datasets serve as a clean baseline of adjustable complexity, into which various types of noise can be introduced. Additionally, they facilitate research into soft label learning and related applications. We demonstrate the application of SYNLABEL, showcasing its ability to precisely quantify label noise and its improvement over existing methodologies.

Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research

TL;DR

Abstract

Paper Structure (12 sections, 2 equations, 4 figures, 1 table)

This paper contains 12 sections, 2 equations, 4 figures, 1 table.

Introduction
Related Work
The SYNLABEL Framework
Dataset Types
Data Transformations
Down the chain
From Ground Truth to Partial Ground Truth
Back up
Application of the Framework
Uncertainty by Feature Hiding
Quantifying Label Noise
Conclusion

Figures (4)

Figure 1: A schematic overview of the SYNLABEL framework. The white boxes represent data, either input $X$ or labels $y$. The gray boxes represent a type of dataset, linked by a solid line to their input and output. The arrows represent the different transformations and functions defined by the framework. $\sim$: sampled. $f$: a function.
Figure 2: The level of label noise generated by Feature Hiding as measured by the mean entropy of the resulting soft labels for different probability density estimation methods and different numbers of features hidden. Average over 50 runs, with 100 values resampled for each feature. KDE: Kernel Density Estimation. MICE: Multivariate Imputation by Chained Equations.
Figure 3: Different methods for introducing uncertainty. (a) The Ground Truth dataset $D^G$ generated from Keel Vehicle, using a Random Forest Classifier. In the remainder of the image: The left column (b,e,h) shows the result of applying Feature Hiding to the ground truth dataset. The middle column (c,f,i) shows the result of applying a uniform noise matrix to $D^G$. The right column (d,g,j) shows the result of applying a random noise matrix to $D^G$. The top row (b,c,d) shows the results for a level of uncertainty of $\bar{TVD}(y^{OS}, y^{G}) = 0.17$, the middle row (e,f,g) for $\bar{TVD}(y^{OS}, y^{G}) = 0.45$, and the bottom row (h,i,j) the result of sampling a hard label from the soft labels from the middle row to obtain $D^{OH}$.
Figure 4: Different noise measures for varying noise rates. Left: the mean $TVD$. Feature Hiding was done by sampling from a marginal distribution constructed via Kernel Density Estimation (KDE). Uniform noise (NCAR) was added by applying noise matrix $T_r$. Right: the mean entropy. Feature Hiding was done by sampling from a conditional distribution using MICE. Random class-conditional noise (NAR) was introduced by a randomly generated $T_r$, with equal probabilities on the main diagonal. $T_r$: transition matrix. ID: instance-dependent (NNAR). $FH$: Feature Hiding. $\Delta_1$: noise introduced by $FH$. $\Delta_2$: noise introduced by applying $T_r$ to $D^G$. $\Delta_3$: noise introduced by applying $T_r$ to $D^{PG}$.

Theorems & Definitions (5)

Definition 1
Definition 2
Definition 3
Definition 4
Definition 5

Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research

TL;DR

Abstract

Generating the Ground Truth: Synthetic Data for Soft Label and Label Noise Research

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (5)